Life Expectancy (WHO)
Uvod i opis problema¶
Očekivani životni vek predstavlja jedan od najvažnijih pokazatelja kvaliteta života i razvijenosti jednog društva. On ne odražava samo zdravstveno stanje populacije, već i nivo ekonomskog razvoja, obrazovanja, dostupnost medicinske zaštite, higijenske uslove, ishranu, političku stabilnost i mnoge druge faktore.
U savremenom svetu, zahvaljujući velikim količinama dostupnih podataka i napretku u oblasti mašinskog učenja, moguće je analizirati faktore koji utiču na očekivani životni vek i modelovati njihove međusobne odnose kroz različite implementacije.
U ovom radu primenjivaćemo metode analize podataka i regresione modele kako bi istražili koji faktori imaju najjači uticaj na očekivani životni vek i koliko precizno je moguće predvideti njegovu vrednost na osnovu dostupnih socio-ekonomskih i zdravstvenih pokazatelja. Korišćenjem tehnika poput regularizacije, selekcije atributa i evaluacije modela nad trening i test skupom, cilj je dobiti robustan i interpretabilan model koji ne samo da predviđa, već i objašnjava obrasce u podacima.
Opis problema¶
Iz perspektive mašinskog učenja, problem predikcije očekivanog životnog veka predstavlja zadatak regresije. Cilj je na osnovu poznatih karakteristika jedne zemlje - kao što su stopa smrtnosti odraslih (Adult Mortality), BDP, nivo obrazovanja, stopa imunizacije, zastupnost bolesti, potrošnja na zdravstvo i drugi indikatori, predvideti životni vek populacije.
Skup podataka obuhvata više zemalja kroz različite vremenske periode i sadrži kombinaciju numeričkih i kategorijskih promenljivih. Takva struktura podataka uvodi nekoliko izazova:
Visoka dimenzionalnost: veći broj potencijalnih prediktora može dovesti do prekomernog prilagođavanja (overfitting), zbog čega je neophodna pažljiva selekcija atributa.
Multikolinearnost: pojedini socio-ekonomski indikatori su međusobno snažno povezani, što može destabilizovati klasične regresione modele.
Različite skale i distribucije podataka: određene promenljive pokazuju izraženu asimetriju i prisustvo ekstremnih vrednosti, zbog čega je potrebna transformacija (npr. log transformacija).
Razlika između razvijenih i nerazvijenih zemalja: podaci pokazuju jasnu strukturnu podelu, što može uticati na interpretaciju modela i stabilnost koeficijenata.
Motivacija ovog rada nije samo izgradnja modela sa optimalnim metrikama, već razumevanje strukture podataka i identifikovanje faktora koji najviše doprinose dužem životnom veku. Analizom koeficijenata, značajnosti promenljivih i poređenjem različitih modela (uključujući regularizovane pristupe poput Lasso i Ridge regresije), dolazi se do uvida u to kako zdravstveni, ekonomski i društveni faktori oblikuju dugovečnost populacije.
Krajnji cilj projekta je konstruisati model koji može da objasni što veću varijabilnost očekivanog životnog veka.
Početna konfiguracija¶
!pip install pandas
!pip install numpy
!pip install seaborn
!pip install scipy
!pip install requests
!pip install scikit-learn
!pip install statsmodels
!pip install matplotlib
!pip install xgboost
Requirement already satisfied: pandas in ./.venv/lib/python3.12/site-packages (2.3.3) Requirement already satisfied: numpy>=1.26.0 in ./.venv/lib/python3.12/site-packages (from pandas) (2.4.1) Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.12/site-packages (from pandas) (2025.2) Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.12/site-packages (from pandas) (2025.3) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0) Requirement already satisfied: numpy in ./.venv/lib/python3.12/site-packages (2.4.1) Requirement already satisfied: seaborn in ./.venv/lib/python3.12/site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./.venv/lib/python3.12/site-packages (from seaborn) (2.4.1) Requirement already satisfied: pandas>=1.2 in ./.venv/lib/python3.12/site-packages (from seaborn) (2.3.3) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in ./.venv/lib/python3.12/site-packages (from seaborn) (3.10.8) Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.3) Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.61.1) Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.9) Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (25.0) Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (12.1.1) Requirement already satisfied: pyparsing>=3 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.3.2) Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2025.2) Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2025.3) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0) Requirement already satisfied: scipy in ./.venv/lib/python3.12/site-packages (1.17.0) Requirement already satisfied: numpy<2.7,>=1.26.4 in ./.venv/lib/python3.12/site-packages (from scipy) (2.4.1) Requirement already satisfied: requests in ./.venv/lib/python3.12/site-packages (2.32.5) Requirement already satisfied: charset_normalizer<4,>=2 in ./.venv/lib/python3.12/site-packages (from requests) (3.4.4) Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.12/site-packages (from requests) (3.11) Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.12/site-packages (from requests) (2.6.3) Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.12/site-packages (from requests) (2026.1.4) Requirement already satisfied: scikit-learn in ./.venv/lib/python3.12/site-packages (1.8.0) Requirement already satisfied: numpy>=1.24.1 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (2.4.1) Requirement already satisfied: scipy>=1.10.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (1.17.0) Requirement already satisfied: joblib>=1.3.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (1.5.3) Requirement already satisfied: threadpoolctl>=3.2.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (3.6.0) Requirement already satisfied: statsmodels in ./.venv/lib/python3.12/site-packages (0.14.6) Requirement already satisfied: numpy<3,>=1.22.3 in ./.venv/lib/python3.12/site-packages (from statsmodels) (2.4.1) Requirement already satisfied: scipy!=1.9.2,>=1.8 in ./.venv/lib/python3.12/site-packages (from statsmodels) (1.17.0) Requirement already satisfied: pandas!=2.1.0,>=1.4 in ./.venv/lib/python3.12/site-packages (from statsmodels) (2.3.3) Requirement already satisfied: patsy>=0.5.6 in ./.venv/lib/python3.12/site-packages (from statsmodels) (1.0.2) Requirement already satisfied: packaging>=21.3 in ./.venv/lib/python3.12/site-packages (from statsmodels) (25.0) Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.12/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.12/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.2) Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.12/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.3) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas!=2.1.0,>=1.4->statsmodels) (1.17.0) Requirement already satisfied: matplotlib in ./.venv/lib/python3.12/site-packages (3.10.8) Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.3.3) Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.12/site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.12/site-packages (from matplotlib) (4.61.1) Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.4.9) Requirement already satisfied: numpy>=1.23 in ./.venv/lib/python3.12/site-packages (from matplotlib) (2.4.1) Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.12/site-packages (from matplotlib) (25.0) Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.12/site-packages (from matplotlib) (12.1.1) Requirement already satisfied: pyparsing>=3 in ./.venv/lib/python3.12/site-packages (from matplotlib) (3.3.2) Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0) Requirement already satisfied: xgboost in ./.venv/lib/python3.12/site-packages (3.2.0) Requirement already satisfied: numpy in ./.venv/lib/python3.12/site-packages (from xgboost) (2.4.1) Requirement already satisfied: nvidia-nccl-cu12 in ./.venv/lib/python3.12/site-packages (from xgboost) (2.29.3) Requirement already satisfied: scipy in ./.venv/lib/python3.12/site-packages (from xgboost) (1.17.0)
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import requests
import math
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
from sklearn.metrics import roc_auc_score
from scipy.stats import shapiro
from scipy.stats import chi2_contingency
from scipy.stats import shapiro, ttest_ind, mannwhitneyu, f_oneway, kruskal, spearmanr
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import matplotlib.pyplot as plt
Učitavanje podataka¶
dataframe = pd.read_csv("life_expectancy_data.csv")
dataframe.columns = dataframe.columns.str.strip()
dataframe.head()
| Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | ... | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 10-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
5 rows × 22 columns
dataframe = dataframe.replace(" ",np.nan)
dataframe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2938 entries, 0 to 2937 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2938 non-null object 1 Year 2938 non-null int64 2 Status 2938 non-null object 3 Life expectancy 2928 non-null float64 4 Adult Mortality 2928 non-null float64 5 infant deaths 2938 non-null int64 6 Alcohol 2744 non-null float64 7 percentage expenditure 2938 non-null float64 8 Hepatitis B 2385 non-null float64 9 Measles 2938 non-null int64 10 BMI 2904 non-null float64 11 under-five deaths 2938 non-null int64 12 Polio 2919 non-null float64 13 Total expenditure 2712 non-null float64 14 Diphtheria 2919 non-null float64 15 HIV/AIDS 2938 non-null float64 16 GDP 2490 non-null float64 17 Population 2286 non-null float64 18 thinness 10-19 years 2904 non-null float64 19 thinness 5-9 years 2904 non-null float64 20 Income composition of resources 2771 non-null float64 21 Schooling 2775 non-null float64 dtypes: float64(16), int64(4), object(2) memory usage: 505.1+ KB
Exploratory Data Analaysis¶
Sada nakon što smo učitali podatke, možemo krenuti u razmatranje istih. Pred razmatranje postavljamo par pitanja "Koju promenljivu treba naš model da prediktuje?" "U kojoj zavisnosti je ciljana promenljiva sa ostalim promenljivama skupa podataka?" "Kako da opišemo te zavisnosti?" itd. Kako bismo odgovorili na sva ova pitanja, vodimo se primranom metodom za opisivanje podataka - Eksplorativna Analiza Podataka. Ideja ove metode je da putem grafičkih reprezentacija napravimo uvid u odnos ciljne promenljive sa svim ostalim nezavisnim promenljivama, ovo radimo kako bismo pronašli linearne zavisnosti, korelacije i osobine ostalih promenljivih koje mogu opisati ciljnu promenljivu odnosno, želimo da uočimo koje promenljive su relevantne i koje irelevantne za naš model.
Krenimo prvo od uopštenog opisa svih podataka dataset-a.
dataframe.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Country | 2938 | 193 | Afghanistan | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Year | 2938.0 | NaN | NaN | NaN | 2007.51872 | 4.613841 | 2000.0 | 2004.0 | 2008.0 | 2012.0 | 2015.0 |
| Status | 2938 | 2 | Developing | 2426 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Life expectancy | 2928.0 | NaN | NaN | NaN | 69.224932 | 9.523867 | 36.3 | 63.1 | 72.1 | 75.7 | 89.0 |
| Adult Mortality | 2928.0 | NaN | NaN | NaN | 164.796448 | 124.292079 | 1.0 | 74.0 | 144.0 | 228.0 | 723.0 |
| infant deaths | 2938.0 | NaN | NaN | NaN | 30.303948 | 117.926501 | 0.0 | 0.0 | 3.0 | 22.0 | 1800.0 |
| Alcohol | 2744.0 | NaN | NaN | NaN | 4.602861 | 4.052413 | 0.01 | 0.8775 | 3.755 | 7.7025 | 17.87 |
| percentage expenditure | 2938.0 | NaN | NaN | NaN | 738.251295 | 1987.914858 | 0.0 | 4.685343 | 64.912906 | 441.534144 | 19479.91161 |
| Hepatitis B | 2385.0 | NaN | NaN | NaN | 80.940461 | 25.070016 | 1.0 | 77.0 | 92.0 | 97.0 | 99.0 |
| Measles | 2938.0 | NaN | NaN | NaN | 2419.59224 | 11467.272489 | 0.0 | 0.0 | 17.0 | 360.25 | 212183.0 |
| BMI | 2904.0 | NaN | NaN | NaN | 38.321247 | 20.044034 | 1.0 | 19.3 | 43.5 | 56.2 | 87.3 |
| under-five deaths | 2938.0 | NaN | NaN | NaN | 42.035739 | 160.445548 | 0.0 | 0.0 | 4.0 | 28.0 | 2500.0 |
| Polio | 2919.0 | NaN | NaN | NaN | 82.550188 | 23.428046 | 3.0 | 78.0 | 93.0 | 97.0 | 99.0 |
| Total expenditure | 2712.0 | NaN | NaN | NaN | 5.93819 | 2.49832 | 0.37 | 4.26 | 5.755 | 7.4925 | 17.6 |
| Diphtheria | 2919.0 | NaN | NaN | NaN | 82.324084 | 23.716912 | 2.0 | 78.0 | 93.0 | 97.0 | 99.0 |
| HIV/AIDS | 2938.0 | NaN | NaN | NaN | 1.742103 | 5.077785 | 0.1 | 0.1 | 0.1 | 0.8 | 50.6 |
| GDP | 2490.0 | NaN | NaN | NaN | 7483.158469 | 14270.169342 | 1.68135 | 463.935626 | 1766.947595 | 5910.806335 | 119172.7418 |
| Population | 2286.0 | NaN | NaN | NaN | 12753375.120052 | 61012096.508428 | 34.0 | 195793.25 | 1386542.0 | 7420359.0 | 1293859294.0 |
| thinness 10-19 years | 2904.0 | NaN | NaN | NaN | 4.839704 | 4.420195 | 0.1 | 1.6 | 3.3 | 7.2 | 27.7 |
| thinness 5-9 years | 2904.0 | NaN | NaN | NaN | 4.870317 | 4.508882 | 0.1 | 1.5 | 3.3 | 7.2 | 28.6 |
| Income composition of resources | 2771.0 | NaN | NaN | NaN | 0.627551 | 0.210904 | 0.0 | 0.493 | 0.677 | 0.779 | 0.948 |
| Schooling | 2775.0 | NaN | NaN | NaN | 11.992793 | 3.35892 | 0.0 | 10.1 | 12.3 | 14.3 | 20.7 |
Za dosta opisnih polja uočavamo da vrednost nije broj, odnosno da vrednost nedostaje što nam na prvi pogled daje naznaku da će ovaj set podataka biti problematičan za čišćenje. Kod ostalih podataka možemo uglavnom videti manje više očekivane raspodele.
Na prvi pogled za promenljivu BMI vidimo da ima jako čudne vrednosti, mean = 38 na svetskom nivou bi ukazivalo na to da smo verovatno napokon prevazišli glad u Africi.
Percentage expenditure takodje ima nelogičnu srednju vrednost koja prelazi 100%
Dalje procene ćemo svakako izvršiti pošto ćemo svaku promenljivu posmatrati zasebno.
Life Expectancy¶
Promenljiva Life expectancy predstavlja prosečan broj godina koje se očekuje da će novorođena osoba živeti.
Ova promenljiva je jedan od najvažnijih pokazatelja ukupnog nivoa razvoja jedne zemlje, jer indirektno odražava kvalitet zdravstvenog sistema, životni standard, nivo obrazovanja, pristup čistoj vodi i sanitaciji, ishranu, bezbednost, kao i socio-ekonomske uslove. Veće vrednosti ove promenljive ukazuje da je država za koju vršimo predvidjanje stabilna, razvijena, dosta ulaže u zdravstveni sistem, ima visok BDP po glavi stanovnika, ne postoje zarazne bolesti koje haraju tom državom i slično. Sa druge strane, niže vrednosti često su povezane sa siromaštvom, zaraznim bolestima, političkom nestabilnošću i slabom zdravstvenom infrastrukturom. Ideja je da napravimo model koji će predvidjati vrednosti za ovu promenljivu na osnovu ostalih socio-ekonomskih faktora (promenljivih) kako bi objasnili njen nivo.
dataframe[["Life expectancy"]].describe().T.join(
pd.DataFrame({
"median" : [dataframe["Life expectancy"].median()]
},index=["Life expectancy"])
)
| count | mean | std | min | 25% | 50% | 75% | max | median | |
|---|---|---|---|---|---|---|---|---|---|
| Life expectancy | 2928.0 | 69.224932 | 9.523867 | 36.3 | 63.1 | 72.1 | 75.7 | 89.0 | 72.1 |
Vidimo da na osnovu podataka kojima bratamo u globalu, osobe žive ≈ 69 godina.
Posmatrajmo sada distribuciju ciljne promenljive, u zavisnosti od potrebe, možemo i transformisati ciljnu promenljivu u slučaju da je njena raspodela Right Skewed logaritamskom transformacijom.
plt.figure(figsize=(8, 5))
plt.hist(dataframe["Life expectancy"], bins=30,edgecolor="black",linewidth=1)
plt.xlabel("Life expectancy (godine)")
plt.ylabel("Godina starosti")
plt.title("Distribucija Life Expectancy")
plt.show()
Vidimo da je raspodela "Life Expectancy" promenljive blago Left Skewed što nam generalno naznačava da je skroz okej da je zadržimo takvu kakva je, odnosno nije nam potreban bilo kakva transformacija nad promenljivom posebno zato što ne možemo videti ni izrazite outliere na grafiku.
Sada možemo krenuti u razmatranje nezavisnih promenljivih.
COUNTRY¶
Promenljiva Country predstavlja državu.
Sagledajmo sada od koliko jedinstvenih država se naš dataset sastoji.
uniques = dataframe["Country"].nunique()
print("Broj jedinstvenih država u datasetu :",uniques)
Broj jedinstvenih država u datasetu : 193
Pošto je broj jedinistvenih država velik, posmatraćemo vrednosti samo za 15 država.
top_countries = dataframe["Country"].value_counts().head(15).index
dataframe[dataframe["Country"].isin(top_countries)].boxplot(
column="Life expectancy",
by="Country",
figsize=(10, 6),
rot=45
)
plt.title("Life Expectancy grupisan po Country")
plt.suptitle("")
plt.grid(False)
plt.show()
Ovim grafikom vidimo odnos Life Expectancy za svaku državu, odnosno podatke koliko je životno očekivanje za svaku zabeleženu godinu po državi. Na grafiku možemo uočiti i par vrednosti van "whiskers-a" što naznačava outlier vrednosti. U svakom slučaju promenljiva Country se ne čini kao pouzdani prediktor pošto ne postoji dovoljan broj zabeleženih godina za svaku državu, dodatno što je broj unikatnih država poprilično velik što može biti problem pri enkodiranju ove promenljive što bi proizvelo popriličnu kompleksnost modela.
YEAR¶
Promenljiva Year predstavlja godinu zapisa svih faktora jedne države. Kada bi postojalo dovoljno ovakvih zapisa mogli bismo i predvidjati Očekivani životni vek nacija za narednu godinu u poredjenju sa podacima prošlih godina.
Svakako prvo ćemo iscrtati boxplot grafik za Year i Life Expectancy.
years = dataframe["Year"].value_counts().index
dataframe[dataframe["Year"].isin(years)].boxplot(
column="Life expectancy",
by="Year",
figsize=(10, 6),
rot=45
)
plt.title("Life Expectancy grupisan po Year")
plt.suptitle("")
plt.grid(False)
plt.show()
Sa grafika očigledno vidimo da imamo zapise za samo 16 godina, što nije dovoljno da za svaku državu predvidjamo životni vek zasebno, posebno bi bilo teško sprovesti ovo običnom linearnom regresijom. Posmatranjem 2005. vidimo da postoji više outlier-a, oni mogu biti naznaka nekog rata, epidemije, ili katastrofe u kojem je preminuo veći broj država nego uobičajeno.
Status¶
dataframe["Status"].unique()
array(['Developing', 'Developed'], dtype=object)
Promenljiva Status je kategorijska promenljiva i ima dve vrednosti "Developing" i "Developed". Na osnovu domenskog znanja, znamo da sve države koje su Razvijene ("Developed") imaju veći BDP po glavi stanovnika, bolje uslove za život, bolji zdravstveni sistem i pobudjenu svest o bitnosti zdravlja, u tom smislu ova promenljiva postavlja čistu granicu socio-ekonosmkih i razvojnih osobina država.
S toga ćemo sve dalje grafike predstavljati koristeći i ovu kategorijsku promenljivu.
status = dataframe["Status"].value_counts().index
dataframe[dataframe["Status"].isin(status)].boxplot(
column="Life expectancy",
by="Status",
figsize=(10, 6),
rot=45
)
plt.title("Life Expectancy grupisan po Status")
plt.suptitle("")
plt.grid(False)
plt.show()
Grafik dokazuje da je naša pretpostavka na osnovu domenskog znanja tačna, te da su sve vrednosti boxplota za razvijene države uže grupisane oko gornjih vrednosti Life expectency-a sa većom prosečnom vrednošću u odnosu na nerazvijene države, s toga vidimo da Status zaista čini jak kategorijski razgranitelj za ciljnu promenljivu.
Adult Mortality¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Adult Mortality"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Adult Mortality")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Adult Mortality by Status")
plt.legend()
plt.show()
Promenljiva Adult Mortality predstavlja broj smrtnih slučajeva na 1000 stanovnika.
Odavde vidimo negativnu ali poprilično jaku povezanost Adult Mortality-a i Life Expectancy-a (što je Adult Mortality veci to je manji Life expectancy)
Uz to da nam outlier-i (donji desni podaci), ukazuju na trend koji je mozda izazvan epidemijom, ratovima, katastrofe itd.
Jasno možemo razgraničiti da "Developed" države se grupišu oko levog gornjeg ugla grafika što je očekivano i dodatno podstiče značajnost "Status" promenljive.
Iako se ova promenljiva čini kao dobar prediktor, ne smemo je koristiti u predikciji jer ona predstavlja "Data leakage", odnosno Adult Moratilty direktno opisuje Life Expectancy (Adult Mortality je praktično sadržan u promenljivoj Life expectancy) čime bi mogli da dostignemo nerealno visoke performanse modela ali time ne bi ostvarili prave prediktivne vrednosti u praksi.
infant deaths¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["infant deaths"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("infant deaths")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs infant deaths by Status")
plt.legend()
plt.show()
Promenljiva infant deaths pokazuje apsolutan broj infant deaths na 1000 stanovnika, pa trend od preko 1000 infant death sigurno predstavlja data error što je veoma smisleno s obzirom da znamo da je dosta podataka ovog dataseta scrapeovano sa interneta i dolaze sa različitih izvora.
Kao i u prethodnim razmatranjima, vidimo da promenljiva Status dobro razgraničava očekivani životni vek.
Možemo smatrati da je ova promenljiva jako ozbiljan indikator u odredjivanju životnog veka jedne populacije s obzirom da se za države sa velikim brojem smrti novorodjenčadi odlikuje jako loš zdravstveni sistem kao i svest o brizi novorodjene dece. Zaključivši ovo, ustanovićemo da sve države za koje infant deaths premašuje 200 ima jako mali Life expectancy što se očigledno i vidi sa grafika.
Kako bismo pravilno posmatrali raspodelu ove promenljive, postavićemo plafon vrednosti za infant deaths na 150 pri razmatranju.
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["infant deaths"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlim(0,150)
plt.xlabel("infant deaths")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs infant deaths by Status")
plt.legend()
plt.show()
Posmatrajući ovako limitiran grafik, vidimo da je promenljiva infant deaths očigledno right skewed što nam pruža mogućnost da odradimo logaritamsku transformaciju nad podacima. Takodje jedna od opcija bi bila da razdvojimo ovu promenljivu na tri kategorije low , medium , high. Od posebnog značaja nam je transformacija nad ovom promenljivom kako bismo ublažili efekat outliera.
ALCOHOL¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Alcohol"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Alcohol")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Alcohol by Status")
plt.legend()
plt.show()
filtered_df_alcohol = (
dataframe.loc[dataframe["Alcohol"] >= 15,
["Country","Alcohol"]]
.sort_values(by="Alcohol", ascending=False)
)
filtered_df_alcohol
| Country | Alcohol | |
|---|---|---|
| 874 | Estonia | 17.87 |
| 228 | Belarus | 17.31 |
| 873 | Estonia | 16.99 |
| 875 | Estonia | 16.58 |
| 227 | Belarus | 16.35 |
| 876 | Estonia | 15.52 |
| 1523 | Lithuania | 15.19 |
| 1525 | Lithuania | 15.14 |
| 877 | Estonia | 15.07 |
| 872 | Estonia | 15.04 |
| 1524 | Lithuania | 15.04 |
Promenljiva Alcohol predstavlja konzumaciju alkohola na nivou glavnih gradova zabeleženih država.
Posmatrajući grafik ne vidimo jaku linearnu povezanost alkohola i Life expectancy-a, povezanost bi se mogla posmatrati u vidu logaritamske funkcije zbog desne asimetrije što nam naznačava da i ova promenljiva dolazi u obzir za logaritamsku transformaciju. Osmatrajući i države koje su imale konzumaciju sa više od 15 litara po glavi stanovnika, ovi podaci ne deluju kao outlieri pošto su ovo države istočne Evrope poznate po velikom konzumiranju alkohola.
Imamo i zemlje koje imaju veliku konzumaciju alkohola ali su pak razvijene, imaju dobru medicinu itd pa zbog toga zadrzavaju solidan life expectancy, sto nam ukazuje da je alkohol jasno povezan sa razvojem države, očekujemo da gradjani razvijene države imaju veću svest o načinu na koji konzumiraju alkohol (manje količine ali češće, pojedini i na dnevnom nivou).
Promenljiva svakako dolazi u obzir pri razmatranju Life expectancy promenljive s obzirom da se u paru sa promenljivom Status jasno vidi efekat na ciljnu promenljivu.
PERCENTAGE EXPENDITURE¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["percentage expenditure"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("percentage expenditure")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs percentage expenditure by Status")
plt.legend()
plt.show()
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["percentage expenditure"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlim(0,2500)
plt.xlabel("percentage expenditure")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs percentage expenditure by Status (<=2500)")
plt.legend()
plt.show()
Promenljiva Percentage expenditure predstavlja potrošnju na zdravstvo po glavi stanovnika, slutimo da je vrlo moguća multikolinearnost sa GDP s toga je jako bitno da pri feature selection-u proverimo VIF metrikom korelacije.
Jasno vidimo stub sa leve strane, koji ima raspodelu od minimuma do maksimuma za life expectancy, sto znaci da i drugi faktori jasno uticu na life expectancy ali ujedno da potrošnja za vrednosti do 2500 veoma jako utiče na life expectancy, dok otprilike preko 2500 dolazi do zasićenja, i ne vidimo rast u life expectancy-u.
Outlier-i nam ovde prerdstavlju life expectancy za koje je visok Percentage expenditure a Life expecntacy ima vrednosti <50 pošto je poprilično ispod prosečnog očekivanog životnog veka na globalnom nivou. Oni ne moraju nužno biti uklonjeni pošto možda ukazuju na realne situacije (rat, epidemija...)
Posmatranjem raspodele takodje vidimo da i ova promenljiva može biti pogodna za logaritamsku transformaciju, ali u slučaju da ona nije multikolinearna.
Hepatitis B¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Hepatitis B"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Hepatitis B")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Hepatitis B (%) by Status")
plt.legend()
plt.show()
Promenljiva Hepatitis B opisuje imunizaciju medju 1-godisnjom decom u procentima.
Postoji direktna povezanost sa life expectancy-om ali veza nije linerna (dosta tačaka sa visokom imunizacijom i Life expectancy-em), poprilicno je raspršena, može se upotrebiti kao kategorijska promenljiva ili je možemo spojiti sa ostalim promenljivama koje opisuju imunizacije neke bolesti stvorivši imunološki indeks.
Takodje imamo jasne high leverage point-ove (0-15%,95-100%), gde vrednosti 0-15% očekujemo da odlikuju siromašne države dok za države koje poseduju ove vrednosti ali da su pritom razvijene smatramo da predstavljaju informativne outliere, gde odredjeni primeri imaju mali Life expectancy iako imaju jak % imunizacije, što ukazuje na uticaj drugih faktora.
MEASLES¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Measles"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Measles")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Measles by Status")
plt.legend()
plt.show()
filtered_df_measles = (
dataframe.loc[dataframe["Measles"] >= 100000,
["Country","Measles"]]
.sort_values(by="Measles", ascending=False)
)
filtered_df_measles
| Country | Measles | |
|---|---|---|
| 1908 | Nigeria | 212183 |
| 731 | Democratic Republic of the Congo | 182485 |
| 1907 | Nigeria | 168107 |
| 1905 | Nigeria | 141258 |
| 725 | Democratic Republic of the Congo | 133802 |
| 567 | China | 131441 |
| 570 | China | 124219 |
| 1575 | Malawi | 118712 |
| 1903 | Nigeria | 110927 |
| 568 | China | 109023 |
Promenljiva Measles predstavlja broj prijavljenih slučajeva malih boginja na 1000 stanovnika.
Dosta podataka za Measles pivotira oko 0, što je normalan indikator pošto većina drzava nema prijavljen veliki broj slučajeva malih boginja, očigledno se ne može uočiti direktna linearna veza izmedju slučajeva malih boginja i life expectancy-a.
Ekstremni slučajevi (>100 000) ukazuju na epidemije malih boginja, ovi high leverage podaci su opravdano veliki za te države i godine ali ne mogu ukazati na znatno bolji životni standard koji samim tim utiče na Life expectancy jer epidemije malih boginja mogu biti prisutne u većini delova sveta.
BMI¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["BMI"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("BMI")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs BMI")
plt.legend()
plt.show()
Promenljiva BMI predstavlja indeks telesne mase i koristi se kako bi opisala gojaznost osoba. BMI možemo izračunati tako što podelimo težinu osobe u kilogramima sa kvadratom visine te osobe.
Može se uočiti solidna linearna veza BMI i Life expectancy ali je očigledno da dobar deo ovih podataka predstavlja data errore pošto se za dosta država odlikuje da njihove populacije imaju prosečan BMI od preko 40 što je nerealno s obzirom da države poput Nauru, Američke Samoe, Tokelau koje se smatraju za države sa najvećom vrednošću BMI imaju prosečan BMI od ~34. Ovakvi podaci na nivou države totalno nemaju smisla.
Najbolja odluka za ovaj feature bi bio dropping celog feature-a.
under-five deaths¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["under-five deaths"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("under-five deaths")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs under-five deaths by Status")
plt.legend()
plt.show()
Promenljiva under-five deaths predstavlja broj preminule dece uzrasta manjeg od 5 godina. Pošto već imamo promenljivu koja posmatra broj preminule novorodjenčadi, posmatrajući raspodele ove dve promenljive, zaključujemo da iziskuju praktično identične podatke, s toga nam je za odabir prediktora modela svejedno koju ćemo od te dve promenljive odabrati.
POLIO¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Polio"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Polio")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Polio deaths by Status")
plt.legend()
plt.show()
Promenljiva Polio predstavlja procentualni broj vakcinisanih 1-godišnjaka.
Na osnovu raspodele možemo doći do praktično istih zapažanja kao za promenljivu Hepatitis B.
Pošto je ova promenljiva na istoj skali kao i promenljiva Hepatitis B možemo je kombinovati kako bismo napravili imunizacioni index države.
TOTAL EXPENDITURE¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Total expenditure"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Total expenditure")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Total Expenditure by Status")
plt.legend()
plt.show()
Total expenditure predstavlja ukupnu potrošnju države na zdravstvo u procentima.
Total expenditure ima jako rasutu distribuciju, i sam po sebi je vrlo loš feature, ima prisutne i high leverage pointove koji ne uticu na Life expectancy.
Generalno rečeno, ova promenljiva nema nikakvu prediktivnu moć za rešavanje problema.
DIPTHTHERIA¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Diphtheria"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Diphtheria")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Diphtheria by Status")
plt.legend()
plt.show()
Promenljiva Diphtheria predstavlja procentualni broj vakcinisanih 1-godišnjaka. Dolazimo do istih zaključaka kao i za ostale imunološke promenljive (Hepatitis B i Polio).
HIV/AIDS¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["HIV/AIDS"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("HIV/AIDS")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs HIV/AIDS by Status")
plt.legend()
plt.show()
Promenljiva HIV/AIDS predstavlja broj umrle dece od ove bolesti uzrasta 0-4 godine.
Posmatrajući grafik odmah uočavamo solidnu negativnu korelaciju u odnosu na ciljnu promenljivu, takodje posmatrajući razvijene države možemo uočiti da razvijene države u potpunosti nemaju niti slute na mogućnost epidemije HIV-a što naznačava da je HIV u potpunosti karakteristika razvijenosti zdravstvenog sistema jedne države. Takodje jasno možemo videti da za sve države koje imaju više od 1% polako ali sigurno očekivani životni vek opada.
Ovakva zapažanja direktno pokazuju koliko veliku rolu u proceni očekivanog životnog veka mogu imati bolesti, pošto su one najčešće i reprezentativni faktor zdravstvenog sistema jedne države.
GDP¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["GDP"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("GDP")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs GDP by Status")
plt.legend()
plt.show()
GDP (BDP - Bruto domaći proizvod) predstavlja ukupno stvoren domaći dohodak jedne države.
Pošto odmah uočavamo jako desno asimetrčnost podataka, radi boljeg razmatranja odmah iscrtavamo ovaj grafik na logaritamskoj skali.
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["GDP"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xscale("log")
plt.xlabel("GDP")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs GDP by Status")
plt.legend()
plt.show()
Na grafiku je prisutan klaster u gornjem desnom uglu koji odlikuju razvijene države, što jasno naznačava povezanost sa promenljivom Life expectancy. Iako su podaci za nerazvijene države rasuti svuda po grafiku, uočljiva je pozitivna korelacija s toga u ovoj promenljivoj leži potencijalna predikstorska moć. U svakom slučaju preko grafika je odlikovano da razvijene države imaju veći iznos GDP-a, što najčešće naznačava i posvećenost i brizi stanovništva države kroz njen zdravstveni sistem, s toga možemo reći da iako je GDP ekonomski aspekt jedne države, on se zasigurno indirektno odražava i na medicinski aspekt države. Pored toga možemo smatrati da se povećanjem GDP-a povećava i kvalitet infrastrukture jedne države (ekološki pristup, čist vazduh, sanitacija).
filtered_df_GDP = (
dataframe.loc[dataframe["GDP"] >= 60000,
["Country","GDP"]]
.sort_values(by="GDP", ascending=False)
)
filtered_df_GDP
| Country | GDP | |
|---|---|---|
| 1539 | Luxembourg | 119172.74180 |
| 1542 | Luxembourg | 115761.57700 |
| 1545 | Luxembourg | 114293.84330 |
| 1540 | Luxembourg | 113751.85000 |
| 1547 | Luxembourg | 89739.71170 |
| 2074 | Qatar | 88564.82298 |
| 2525 | Switzerland | 87998.44468 |
| 1915 | Norway | 87646.75346 |
| 2072 | Qatar | 86852.71190 |
| 2075 | Qatar | 85948.74600 |
| 2522 | Switzerland | 85814.58857 |
| 1918 | Norway | 85128.65759 |
| 2523 | Switzerland | 84658.88768 |
| 2524 | Switzerland | 83164.38795 |
| 2078 | Qatar | 82967.37228 |
| 1549 | Luxembourg | 75716.35180 |
| 2526 | Switzerland | 74276.71842 |
| 1919 | Norway | 74114.69715 |
| 2528 | Switzerland | 72119.56870 |
| 2527 | Switzerland | 69672.47100 |
| 1178 | Iceland | 68348.31817 |
| 114 | Australia | 67792.33860 |
| 115 | Australia | 67677.63477 |
| 1920 | Norway | 66775.39440 |
| 2071 | Qatar | 66346.52267 |
| 1550 | Luxembourg | 65445.88530 |
| 744 | Denmark | 64322.66640 |
| 2529 | Switzerland | 63223.46778 |
| 738 | Denmark | 62425.53920 |
| 116 | Australia | 62245.12900 |
| 113 | Australia | 62214.69120 |
| 741 | Denmark | 61753.66700 |
| 2077 | Qatar | 61478.23813 |
| 1258 | Ireland | 61388.17457 |
| 1257 | Ireland | 61235.41500 |
| 739 | Denmark | 61191.19263 |
Posmatranjem GDP-a koji je veći od 60 000, vidimo da ove tačke iako jesu influental points, ne predstavljaju netačne podatke, pošto je GDP za Luxemburg i stvarno toliko visok. U moru ovih niskih podataka za GDP smo sigurni da postoje data error-i u levom stubu, ali je prirodno da za većinu država bude < 15000 . Dolazimo do zaključka da će GDP uz Status igrati veliku ulogu u prediktivnom modelu.
POPULATION¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Population"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Population")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Population by Status")
plt.legend()
plt.show()
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Population"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlim(0,40000000)
plt.xlabel("Population")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Population by Status")
plt.legend()
plt.show()
filtered_df_population = (
dataframe.loc[dataframe["Population"] >= 1000000000,
["Country", "Population"]]
.sort_values(by="Population", ascending=False)
)
filtered_df_population
| Country | Population | |
|---|---|---|
| 1187 | India | 1.293859e+09 |
| 1194 | India | 1.179681e+09 |
| 1195 | India | 1.161978e+09 |
| 1196 | India | 1.144119e+09 |
| 1197 | India | 1.126136e+09 |
filtered_df_china = dataframe.loc[
dataframe["Country"] == "China",
["Country", "Population"]
]
filtered_df_china
| Country | Population | |
|---|---|---|
| 560 | China | 137122.0 |
| 561 | China | 136427.0 |
| 562 | China | 135738.0 |
| 563 | China | 135695.0 |
| 564 | China | 134413.0 |
| 565 | China | 133775.0 |
| 566 | China | 133126.0 |
| 567 | China | 1324655.0 |
| 568 | China | 1317885.0 |
| 569 | China | 13112.0 |
| 570 | China | 13372.0 |
| 571 | China | 129675.0 |
| 572 | China | 12884.0 |
| 573 | China | 1284.0 |
| 574 | China | 127185.0 |
| 575 | China | 1262645.0 |
Promenljiva Population predstavlja broj stanovnika jedne države.
Population očigledno nema linearne povezanosti sa Life expectancy s toga nećemo preći u šire razmatranje ove promenljive. Podaci od preko 1 milijarde su ocekivani za državu poput Indije, ali i za državu poput Kine, sto je dodatna nelogičnost, ako posmatramo podatke za Kinu, vidimo da su očigledno netacni.
Mimo toga, ne možemo zaključiti nikakvu korelisanost sa Life expectancy.
thinness 1-19 years¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["thinness 10-19 years"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("thinness 1-19 years")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs thinness 1-19 years by Status")
plt.legend()
plt.show()
Promenljiva thinness 1-19 years opisuje učestalost mršavosti medju decom i adolescentima izmedju 10 i 19 godina u procentima (greška u imenovanju kolone pošto nije 1-19 već 10-19) što nam označava BMI koji je ispod referentnih vrednosti, odnosno nedostatak nutritivnih vrednosti u ishrani dece.
Može se uociti umerena negativna linearna povezanost sa Life expectancy, sve klastere koji formiraju liniju možemo videti kao entry-je za zasebne drzave, koje prate odredjeni trend neuhranjenosti.
Svakako je pristuno da je Life expectancy visok za vrednosti koje su blizu 0, ali vertikalni stub koji se javlja svuda naznacava uticaj drugih socio-ekonomskih faktora koji utiču na očekivani životni vek populacije. Ujedno uočavamo da je raspodela jako slična sa infant deaths i under-five deaths. Takodje je jako uočljiv klaster koji formiraju razvijene države s toga ponovno daju potporu značajnosti Status-a.
THINNESS 5-9 YEARS
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["thinness 5-9 years"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("thinness 5-9 years")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs thinness 5-9 years by Status")
plt.legend()
plt.show()
Promenljiva thinness 5-9 years opisuje isti pojam kao i thinness 1-19 years (odnosno 10-19) samo je sada posmatran opseg dece starosti 5-9 godina. Prirodno je da uporedimo grafik ove promenljive sa grafikom pomenute promenljive gde dolazimo do zaključka da su raspodele ove dve promenljive praktično identične, s toga je dovoljno da uzmemo bilo koju od ove dve promenljive kao prediktor našeg modela. Posebno je važno da ne odaberemo obe promenljive za naš model kako bismo izbegli multikolinearnost.
INCOME COMPOSITION OF RESOURCES¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Income composition of resources"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Income composition of resources")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Income composition of resources by Status")
plt.legend()
plt.show()
df_filtered_icr = (
dataframe.loc[dataframe["Income composition of resources"] < 0.1,
["Country","Income composition of resources"]]
.sort_values(by="Income composition of resources", ascending=False)
)
df_filtered_icr
| Country | Income composition of resources | |
|---|---|---|
| 74 | Antigua and Barbuda | 0.0 |
| 2422 | South Sudan | 0.0 |
| 2420 | South Sudan | 0.0 |
| 2419 | South Sudan | 0.0 |
| 2418 | South Sudan | 0.0 |
| ... | ... | ... |
| 860 | Eritrea | 0.0 |
| 849 | Equatorial Guinea | 0.0 |
| 607 | Comoros | 0.0 |
| 606 | Comoros | 0.0 |
| 2857 | Vanuatu | 0.0 |
130 rows × 2 columns
Promenljiva Income composition of resources opisuje razvoj zasnovan na prihodima po stanovniku, koji je normalizovan izmedju 0 i 1.
Za data entry-e gde je ova vrednost = 0.0 na osnovu domenskog znanja, dolazimo do zaključka da ove vrednosti opisuju jako slabo razvijene države koje su u potpunoj stagnaciji i ne postoji nikakva naznaka progresa koja bi samim tim mogla da navede i na povećanje očekivanog životnog veka.
Mimo toga, vidimo jasnu i jaku pozitivnu povezanost ove promenljive sa Life expectancty-jem, gde high leverage pointovi dostižu čak i ~ 90 godina i to posebno za razvijene države.
Ova promenljiva deluje kao siguran kandidat za feature selection.
SCHOOLING¶
plt.figure(figsize=(10, 8))
for status in dataframe["Status"].unique():
subset = dataframe[dataframe["Status"] == status]
plt.scatter(
subset["Schooling"],
subset["Life expectancy"],
alpha=0.4,
label=status
)
plt.xlabel("Schooling")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Schooling by Status")
plt.legend()
plt.show()
Promenljiva Schooling predstavlja prosek godina školovanja jedne države. U poredjenju sa Income composition of resources raspodele su praktično identične, i sagledanjem obe promenljive dolazimo do zaključka da su one ozbiljan kandidat za multikolinearnost, pošto nivo školovanja direktno utiče na svest gradjana jedne države a samim tim i na to u šta treba ulagati novac, odupiranje korupciji i slično.
Uočavamo prisutnost granice od 10 godina, iznad koje je očekivani životni vek jako visok, poduprene time da su većina takvih država razvijene.
S toga ćemo posmatrati samo Schooling pošto iako u suštini opisuju različite pojmove, one su usko povezane.
CISCENJE PODATAKA¶
Prvo proveravamo da li ima duplikata, nemamo duplikate u dataset-u.
dataframe.duplicated().any()
np.False_
U tabeli ispod prikazan je procenat nedostajućih vrednosti za svaki feature u datasetu
(dataframe.isnull().sum()/dataframe.shape[0]*100).round(2)
Country 0.00 Year 0.00 Status 0.00 Life expectancy 0.34 Adult Mortality 0.34 infant deaths 0.00 Alcohol 6.60 percentage expenditure 0.00 Hepatitis B 18.82 Measles 0.00 BMI 1.16 under-five deaths 0.00 Polio 0.65 Total expenditure 7.69 Diphtheria 0.65 HIV/AIDS 0.00 GDP 15.25 Population 22.19 thinness 10-19 years 1.16 thinness 5-9 years 1.16 Income composition of resources 5.68 Schooling 5.55 dtype: float64
Ovde je i graficki prikazano:
df = (dataframe.isna().mean()*100).round(2)
df = df[df > 0].sort_values()
df.plot(kind="barh", figsize=(8,5), title="Nedostajuće vrednosti (%)")
plt.xlabel("%")
plt.tight_layout()
plt.show()
missing_data = dataframe.columns[dataframe.isna().any()]
miss = dataframe[missing_data].isna().astype(int)
corr = miss.corr()
results = []
for i in corr.columns:
for j in corr.columns:
if i < j:
r = corr.loc[i, j]
if r > 0.3:
results.append((i, j, r))
results.sort(key=lambda x: x[2], reverse=True)
print("Korelacija nedostajucih vrednosti")
print("-" * 70)
for i, j, r in results:
print(f"{i:32} {j:32} r={r:.3f}")
Korelacija nedostajucih vrednosti ---------------------------------------------------------------------- Adult Mortality Life expectancy r=1.000 BMI thinness 10-19 years r=1.000 BMI thinness 5-9 years r=1.000 Diphtheria Polio r=1.000 thinness 10-19 years thinness 5-9 years r=1.000 Income composition of resources Schooling r=0.987 Alcohol Total expenditure r=0.895 GDP Population r=0.744 GDP Schooling r=0.559 GDP Income composition of resources r=0.554 Income composition of resources Population r=0.456 Population Schooling r=0.454 BMI Polio r=0.428 BMI Diphtheria r=0.428 Polio thinness 10-19 years r=0.428 Polio thinness 5-9 years r=0.428 Diphtheria thinness 10-19 years r=0.428 Diphtheria thinness 5-9 years r=0.428
Korelacija nedostajućih vrednosti¶
Ova tabela prikazuje korelaciju između nedostajućih vrednosti. U suštini, pokazuje koliko često dva feature-a nemaju podatke u istim redovima.
Iz rezultata se vidi da nedostajuće vrednosti često dolaze u grupama.
BMI, thinness 10–19 years i thinness 5–9 years imaju korelaciju r = 1.000. To znači da kada nedostaje jedan od ovih podataka, nedostaju i ostali. MOzemo zakljuciti d potiču iz istog izvora.
Slično važi za Adult Mortality i Life expectancy, kao i za Diphtheria i Polio, gde nedostajanje podataka takođe potpuno poklapa. To ukazuje da su ti podaci verovatno preuzeti iz istih izvora.
Postoji i jaka korelacija između Income composition of resources i Schooling (r = 0.987), što su socio-ekonomski indikatori. Moguće je da ovi podaci nedostaju za iste zemlje ili godine.
Parovi poput GDP ↔ Population (r = 0.744) i Alcohol ↔ Total expenditure (r = 0.895) pokazuju da ekonomske i finansijske metrike često nedostaju zajedno.
Na osnovu ovoga može se zaključiti da nedostajuće vrednosti u datasetu nisu nasumične, već se pojavljuju u grupama povezanih varijabli.
fig, axis = plt.subplots(figsize=(9,7))
heatmap = axis.imshow(corr, cmap="RdYlBu_r", vmin=-1, vmax=1)
axis.set_xticks(range(len(corr.columns)))
axis.set_yticks(range(len(corr.columns)))
axis.set_xticklabels(corr.columns, rotation=45, ha="right")
axis.set_yticklabels(corr.columns)
for i in range(len(corr)):
for j in range(len(corr)):
axis.text(j, i, round(corr.iloc[i, j], 2),
ha="center", va="center", fontsize=7)
plt.colorbar(heatmap)
axis.set_title("Korelacija nedostajućih vrednosti")
plt.show()
Heatmap vizuelno prikazuje korelaciju nedostajućih vrednosti između feature-a. Tamnije boje (bliže 1) označavaju da dve kolone često nedostaju u istim redovima, dok svetlije boje označavaju slabiju povezanost nedostajanja.
Na heatmapi se jasno uočavaju iste grupe koje smo videli u tabeli, kao što su BMI i thinness varijable, kao i Diphtheria i Polio, koje imaju gotovo identičan obrazac nedostajanja. Ovo potvrđuje da određeni skupovi podataka nedostaju zajedno, verovatno zato što potiču iz istih izvora.
POPULATION
dataframe["Population"].describe()
count 2.286000e+03 mean 1.275338e+07 std 6.101210e+07 min 3.400000e+01 25% 1.957932e+05 50% 1.386542e+06 75% 7.420359e+06 max 1.293859e+09 Name: Population, dtype: float64
Osnovna statistika – Population¶
Feature Population ima veoma veliki raspon vrednosti. Minimalna vrednost iznosi 34, dok maksimalna dostiže 1.29 milijardi, što pokazuje da dataset obuhvata i veoma male države, ali i najnaseljenije zemlje sveta.
Medijana populacije iznosi oko 1.38 miliona, dok je prosečna vrednost znatno veća (12.7 miliona). Ova razlika ukazuje na jaku desnu asimetriju raspodele, jer nekoliko veoma velikih država značajno povećava prosečnu vrednost.
Takođe, standardna devijacija je veoma visoka (≈61 milion), što dodatno potvrđuje veliku varijabilnost populacije između različitih zemalja u datasetu.
plt.hist(dataframe["Population"].dropna(), bins=40, edgecolor="black",linewidth=1)
plt.title("Distribucija populacije")
plt.show()
Histogram prikazuje raspodelu vrednosti populacije u datasetu. Na x-osi su opsezi populacije (u milijardama, zbog velike skale), dok y-osa prikazuje koliko zapisa (country–year kombinacija) spada u taj opseg.
Grafik pokazuje izrazitu desnu asimetriju. Većina država ima relativno malu populaciju, dok mali broj veoma velikih država (npr. Kina i Indija) značajno povećava opseg vrednosti i stvara dugačak rep na desnoj strani raspodele.
pop = pd.to_numeric(dataframe["Population"], errors="coerce")
print(pop.describe())
print("Missing %:", pop.isna().mean()*100)
plt.figure(figsize=(6,4))
plt.hist(np.log10(pop.dropna()), bins=40, edgecolor="black",linewidth=1)
plt.title("Population distribution (log10 scale)")
plt.xlabel("log10(Population)")
plt.ylabel("Count")
plt.show()
count 2.286000e+03 mean 1.275338e+07 std 6.101210e+07 min 3.400000e+01 25% 1.957932e+05 50% 1.386542e+06 75% 7.420359e+06 max 1.293859e+09 Name: Population, dtype: float64 Missing %: 22.19196732471069
Pošto populacija ima veoma veliki raspon vrednosti (od nekoliko desetina do više od milijardu), običan histogram je teško čitljiv jer nekoliko veoma velikih država dominira skalom.
Zato se koristi log10 transformacija. Ona “sabija” velike vrednosti i širi male, pa raspodela postaje preglednija. Na taj način lakše vidimo kako su zemlje raspoređene po veličini populacije, bez da ekstremno velike države potpuno razvuku grafikon.
bad = (pop <= 0)
print("Broj redova sa vrednoscu 0:", bad.sum())
Broj redova sa vrednoscu 0: 0
miss_by_year = dataframe.groupby("Year")["Population"].apply(lambda s: s.isna().mean())
plt.figure(figsize=(7,3))
plt.plot(miss_by_year.index, miss_by_year.values, marker="o")
plt.title("Population missing rate by year")
plt.xlabel("Year")
plt.ylabel("Missing rate")
plt.show()
miss_by_country = dataframe.groupby("Country")["Population"].apply(lambda s: s.isna().mean()).sort_values(ascending=False)
plt.figure(figsize=(8,4))
miss_by_country.head(20).plot(kind="bar", edgecolor="black", linewidth=1)
plt.title("Raspodela procenta nedostajućih vrednosti populacije po državama")
plt.ylabel("Procenat nedostajućih vrednosti")
plt.show()
full_missing = miss_by_country[miss_by_country == 1.0].index.tolist()
print("Broj država kojima populacija potpuno nedostaje:", len(full_missing))
print("Prvih 30 država:", full_missing[:30])
Broj država kojima populacija potpuno nedostaje: 48 Prvih 30 država: ['Antigua and Barbuda', 'Dominica', 'Barbados', 'Bahrain', 'Bahamas', 'Brunei Darussalam', 'Bolivia (Plurinational State of)', 'Gambia', 'Egypt', 'Democratic Republic of the Congo', 'Cuba', "Côte d'Ivoire", "Democratic People's Republic of Korea", 'Congo', 'Czechia', 'Cook Islands', 'The former Yugoslav republic of Macedonia', 'United States of America', 'United Republic of Tanzania', 'United Kingdom of Great Britain and Northern Ireland', 'Marshall Islands', 'Niue', 'Oman', 'Nauru', 'New Zealand', 'Micronesia (Federated States of)', 'Monaco', 'Kyrgyzstan', 'Kuwait', 'Libya']
Za neke države Population nedostaje u 100% redova. U tim slučajevima ne možemo da radimo interpolaciju, jer ne postoji nijedna poznata vrednost kroz godine. Takođe nema smisla popunjavati mean/median iz drugih država, jer populacija jedne države nema veze sa populacijom druge i takva imputacija bi bila proizvoljna.
Najverovatnije je problem u nazivima država pri spajanju podataka (npr. različite verzije imena kao “Czechia” vs “Czech Republic”, “Bolivia (Plurinational State of)” itd.), pa se vrednosti nisu poklopile. Zbog toga ćemo Population popuniti korišćenjem drugog dataset-a sa populacijom i spojiti ga sa ovim podacima.
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"])
prev = df.groupby("Country")["Population"].shift(1)
df["Population growth"] = (df["Population"] - prev) / prev
extreme = df["Population growth"].abs().sort_values(ascending=False).head(20)
print(df.loc[extreme.index, ["Country","Year","Population","Population growth"]].to_string(index=False))
Country Year Population Population growth
Hungary 2011 9971727.0 81069.951220
Ethiopia 2008 83184892.0 10206.987729
Iraq 2015 36115649.0 10121.098935
Maldives 2015 49163.0 1198.097561
Benin 2005 7982225.0 1028.433196
Cameroon 2009 19432541.0 1022.950943
Burundi 2001 6555829.0 1011.326899
Turkmenistan 2003 4655741.0 1008.484172
Peru 2010 29373646.0 1006.430325
Nicaragua 2002 5171734.0 998.368889
Bosnia and Herzegovina 2008 3763599.0 996.244038
Mali 2013 16477818.0 987.649307
Uruguay 2014 3419546.0 980.218364
Pakistan 2005 15399667.0 974.712285
Syrian Arab Republic 2015 18734987.0 972.802537
Turkey 2009 71339185.0 957.447778
Tajikistan 2007 7152385.0 945.458251
Germany 2015 81686611.0 908.397284
Bhutan 2009 714458.0 897.689308
Chad 2006 1421597.0 845.692674
Ovo računa godišnji rast populacije po državama u odnosu na prethodnu godinu: (pop - prethodna) / prethodna.
U izlazu se pojavljuju ekstremne vrednosti (npr. 1000x, 10000x...), što nije realno za promenu populacije u jednoj godini. Najverovatnije znači da je prethodna vrednost bila pogrešno upisana ili da nedostaje podatak za tu godinu, pa račun daje ogroman skok. Zbog toga ove redove posmatramo kao potencijalne greške u podacima i ne uzimamo ih zdravo za gotovo bez dodatne provere.
g = df[df["Country"] == "Hungary"][["Year","Population"]].sort_values("Year")
print(g.to_string(index=False))
g2 = df[df["Country"] == "Hungary"][["Year","Population","Population growth"]].sort_values("Year")
print(g2.to_string(index=False))
Year Population 2000 121971.0 2001 1187576.0 2002 115868.0 2003 1129552.0 2004 117146.0 2005 18765.0 2006 17137.0 2007 15578.0 2008 138188.0 2009 12265.0 2010 123.0 2011 9971727.0 2012 992362.0 2013 989382.0 2014 9866468.0 2015 984328.0 Year Population Population growth 2000 121971.0 NaN 2001 1187576.0 8.736544 2002 115868.0 -0.902433 2003 1129552.0 8.748610 2004 117146.0 -0.896290 2005 18765.0 -0.839815 2006 17137.0 -0.086757 2007 15578.0 -0.090973 2008 138188.0 7.870715 2009 12265.0 -0.911244 2010 123.0 -0.989971 2011 9971727.0 81069.951220 2012 992362.0 -0.900482 2013 989382.0 -0.003003 2014 9866468.0 8.972354 2015 984328.0 -0.900235
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"], errors="coerce")
c = "Hungary"
g = df[df["Country"]==c][["Year","Population"]].sort_values("Year")
plt.figure(figsize=(9,4))
plt.plot(g["Year"], g["Population"], marker="o")
plt.title("Hungary — Population by Year (corrupted scale jumps)")
plt.xlabel("Year")
plt.ylabel("Population")
plt.grid(True, alpha=0.3)
plt.show()
for state in ["Hungary","Luxembourg","Maldives","Germany","India"]:
g = dataframe[dataframe["Country"] == state].sort_values("Year")
y = g["Population"]
plt.figure(figsize=(8,3))
plt.plot(g["Year"], y, marker="o")
plt.title(f"{state} — Population over time")
plt.xlabel("Year")
plt.ylabel("Population")
plt.grid(True, alpha=0.3)
plt.show()
Ovi grafici izgledaju loše za populaciju: vide se ogromni skokovi i padovi skoro na nulu u jednoj godini, što nema smisla za realnu populaciju (populacija ne može da ima takve oscilacije). To nam govori da su podaci pogrešni ili loše popunjeni (npr. neke godine su 0, pa posle dođe prava vrednost i izgleda kao ekstreman rast tj. pad).
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"])
prev = df.groupby("Country")["Population"].shift(1)
df["pop_growth"] = (df["Population"] - prev) / prev
top = df[df["pop_growth"].notna()].copy()
top["abs_growth"] = top["pop_growth"].abs()
print(top.sort_values("abs_growth", ascending=False)[["Country","Year","Population","pop_growth"]].to_string(index=False))
Country Year Population pop_growth
Hungary 2011 9.971727e+06 81069.951220
Ethiopia 2008 8.318489e+07 10206.987729
Iraq 2015 3.611565e+07 10121.098935
Maldives 2015 4.916300e+04 1198.097561
Benin 2005 7.982225e+06 1028.433196
Cameroon 2009 1.943254e+07 1022.950943
Burundi 2001 6.555829e+06 1011.326899
Turkmenistan 2003 4.655741e+06 1008.484172
Peru 2010 2.937365e+07 1006.430325
Nicaragua 2002 5.171734e+06 998.368889
Bosnia and Herzegovina 2008 3.763599e+06 996.244038
Mali 2013 1.647782e+07 987.649307
Uruguay 2014 3.419546e+06 980.218364
Pakistan 2005 1.539967e+07 974.712285
Syrian Arab Republic 2015 1.873499e+07 972.802537
Turkey 2009 7.133918e+07 957.447778
Tajikistan 2007 7.152385e+06 945.458251
Germany 2015 8.168661e+07 908.397284
Bhutan 2009 7.144580e+05 897.689308
Chad 2006 1.421597e+06 845.692674
Armenia 2005 2.981259e+06 824.376246
Romania 2013 1.998369e+07 772.512406
Mozambique 2006 2.154746e+07 735.992954
Senegal 2004 1.955944e+06 115.432169
Cyprus 2005 1.276580e+05 110.882559
India 2001 1.714779e+08 110.645625
Morocco 2006 3.869346e+06 108.871539
South Sudan 2008 9.263136e+06 103.587842
Philippines 2015 1.171636e+07 103.378293
Myanmar 2010 5.155896e+06 102.388799
Niger 2003 1.265687e+06 102.220274
South Sudan 2001 6.974442e+06 102.086822
Senegal 2014 1.454611e+07 101.994442
Afghanistan 2015 3.373649e+07 101.986410
Zambia 2011 1.426476e+07 101.970094
Burkina Faso 2003 1.265462e+07 101.940845
Nigeria 2015 1.811817e+08 101.672790
Zambia 2003 1.142198e+07 101.670442
Ghana 2006 2.211342e+07 101.648320
Sudan 2002 2.867956e+07 101.626774
Paraguay 2001 5.466240e+05 101.613854
Senegal 2012 1.373513e+06 101.569860
Nigeria 2001 1.254634e+08 101.542264
Comoros 2014 7.593850e+05 101.412003
Sao Tome and Principe 2012 1.828890e+05 101.286913
Kenya 2015 4.723626e+07 101.149017
Azerbaijan 2009 8.947243e+06 101.097850
Guatemala 2004 1.279692e+07 100.985408
Sudan 2007 3.228253e+07 100.913494
Israel 2003 6.689700e+04 100.821918
Australia 2012 2.272825e+07 100.727003
Haiti 2001 8.692567e+06 100.676964
Central African Republic 2008 4.345386e+06 100.627438
Cambodia 2015 1.551764e+07 100.569162
Bangladesh 2004 1.413749e+07 100.568988
Cambodia 2006 1.347449e+07 100.524921
Algeria 2008 3.486715e+06 100.428758
Kazakhstan 2012 1.679142e+07 100.418317
Zimbabwe 2006 1.312427e+07 100.398935
Greece 2004 1.955141e+06 100.370923
Uzbekistan 2001 2.496445e+06 100.259228
Uzbekistan 2006 2.648825e+06 100.227691
Bangladesh 2007 1.471392e+08 100.218140
Peru 2006 2.794994e+07 100.216205
Jordan 2001 5.193482e+06 100.211818
Luxembourg 2001 4.415250e+05 100.197570
Turkey 2015 7.827147e+07 100.174559
Bhutan 2012 7.529670e+05 100.055831
Cambodia 2011 1.453789e+07 100.045957
South Africa 2005 4.766672e+05 100.031846
Myanmar 2003 4.762489e+07 100.023911
Paraguay 2005 5.795494e+06 100.012549
Canada 2007 3.288793e+07 99.959089
Cabo Verde 2014 5.264370e+05 99.927339
Kenya 2008 3.914842e+07 99.742452
Switzerland 2004 7.389625e+06 99.688436
Zimbabwe 2003 1.263390e+07 99.648452
Kazakhstan 2007 1.548419e+07 99.622495
Mauritius 2005 1.228254e+06 99.569393
Brazil 2003 1.824821e+08 99.534427
China 2007 1.317885e+06 99.509838
Trinidad and Tobago 2011 1.334788e+06 99.503577
El Salvador 2003 5.971535e+06 99.475073
Thailand 2012 6.784398e+07 99.463013
Netherlands 2015 1.693992e+07 99.439487
El Salvador 2009 6.137276e+06 99.395479
Azerbaijan 2005 8.391850e+05 99.320980
Greece 2001 1.862132e+06 99.179255
Costa Rica 2012 4.654122e+06 99.144640
Central African Republic 2010 4.448525e+06 99.140130
Poland 2008 3.812576e+07 99.000417
Croatia 2008 4.434580e+05 98.967989
Mauritius 2011 1.252440e+05 98.875598
Costa Rica 2014 4.757575e+06 98.863038
Poland 2004 3.818222e+07 98.834026
Papua New Guinea 2004 6.161517e+06 98.823683
Belgium 2015 1.127420e+07 98.809627
Estonia 2009 1.334515e+06 98.746992
Nicaragua 2005 5.379328e+06 98.667019
Greece 2010 1.112134e+07 98.549227
Ukraine 2014 4.527195e+07 98.521532
Italy 2003 5.731323e+06 98.507318
Russian Federation 2003 1.446483e+08 98.507209
Serbia 2010 7.291436e+06 98.491533
Iraq 2006 2.769791e+07 98.480336
Myanmar 2005 4.848261e+07 98.476615
Guinea 2012 1.128147e+07 98.381317
Japan 2011 1.278330e+05 98.326340
Portugal 2006 1.522288e+06 98.281810
Russian Federation 2005 1.435185e+08 98.200364
Pakistan 2011 1.741843e+08 98.183493
Albania 2008 2.947314e+06 98.179392
Bosnia and Herzegovina 2015 3.535961e+06 98.152067
Ukraine 2003 4.781295e+06 98.145568
Zimbabwe 2014 1.541168e+07 98.138502
Bulgaria 2007 7.545338e+06 98.121647
Swaziland 2003 1.873920e+05 97.992076
Burkina Faso 2012 1.657122e+07 97.524418
Albania 2013 2.895920e+05 97.467188
Honduras 2010 8.194778e+06 97.116378
Netherlands 2002 1.614893e+07 97.099412
Malawi 2013 1.657715e+07 96.664872
Ethiopia 2012 9.244418e+07 96.643092
Latvia 2008 2.177322e+06 96.528421
Syrian Arab Republic 2003 1.741527e+07 96.405720
Armenia 2009 2.888584e+06 95.860841
Ecuador 2003 1.328961e+06 95.820705
Albania 2003 3.396160e+05 95.729137
Kazakhstan 2010 1.632158e+07 95.422784
Mexico 2013 1.225360e+08 94.519516
India 2004 1.126136e+09 94.210538
Trinidad and Tobago 2008 1.315372e+06 93.454402
Mexico 2007 1.118363e+08 92.792695
El Salvador 2005 6.289610e+05 91.835572
Sweden 2007 9.148920e+05 91.835312
Georgia 2009 3.978000e+03 91.511628
Djibouti 2008 8.229340e+05 91.030195
Mauritania 2015 4.182341e+06 89.152203
Costa Rica 2003 4.125971e+06 88.067676
Benin 2014 1.286712e+06 88.039651
Liberia 2012 4.181563e+06 87.654419
Uganda 2008 3.166390e+07 87.080782
Ireland 2005 4.159914e+06 87.018154
Bulgaria 2002 7.837161e+06 86.917716
Cabo Verde 2012 5.139790e+05 86.605079
Suriname 2008 5.151480e+05 85.217238
Argentina 2010 4.122389e+07 84.892586
Spain 2002 4.143156e+07 84.353386
Colombia 2002 4.157249e+07 82.328471
Syrian Arab Republic 2013 1.989141e+06 80.955461
Vanuatu 2006 2.146340e+05 72.079333
Chad 2008 1.113386e+07 61.698425
Mali 2001 1.129326e+07 56.393482
Greece 2013 1.965211e+06 16.161766
Rwanda 2012 1.788853e+06 10.794298
Eritrea 2006 4.666480e+05 10.755246
Panama 2001 3.896840e+05 10.685729
Malaysia 2015 3.723155e+06 10.533330
Bhutan 2002 6.639900e+04 10.261703
Cyprus 2007 1.637120e+05 10.244728
Haiti 2013 1.431776e+06 10.105840
Dominican Republic 2013 1.281296e+06 10.093952
Belgium 2008 1.799730e+05 10.070493
Guinea 2015 1.291533e+06 9.893589
Mexico 2004 1.699558e+07 9.863594
Slovenia 2006 2.686800e+04 9.860146
Kazakhstan 2009 1.692710e+05 9.799477
Myanmar 2012 5.986514e+06 9.780083
Brazil 2014 2.421313e+07 9.767939
Lesotho 2011 2.641660e+05 9.759888
Mexico 2012 1.282837e+06 9.697708
Lebanon 2007 4.864660e+05 9.636624
Tunisia 2006 1.196136e+06 9.634021
Jordan 2011 7.574943e+06 9.546549
Mexico 2002 1.435568e+06 9.496373
Dominican Republic 2015 1.528394e+06 9.479649
Equatorial Guinea 2003 6.946110e+05 9.422240
Lebanon 2004 3.863267e+06 9.400111
Lebanon 2015 5.851479e+06 9.388243
Niger 2007 1.466834e+07 9.379050
Montenegro 2001 6.738900e+04 9.375520
Niger 2005 1.361845e+07 9.374285
Tonga 2010 1.413700e+04 9.364370
Angola 2009 2.254955e+07 9.363120
Indonesia 2003 2.254521e+07 9.361523
South Sudan 2004 7.787655e+06 9.360857
Belgium 2002 1.332785e+06 9.359211
Kenya 2007 3.885990e+05 9.355736
Uganda 2004 2.756844e+07 9.354412
Gabon 2012 1.756817e+06 9.351816
Liberia 2006 3.375838e+06 9.351426
Angola 2004 1.886572e+07 9.346625
South Sudan 2006 8.468152e+06 9.341177
Mali 2008 1.413822e+07 9.338233
Chad 2011 1.228865e+07 9.337700
Gabon 2011 1.697110e+05 9.334998
Iraq 2012 3.277657e+07 9.330641
Tunisia 2009 1.521834e+06 9.329003
Mali 2005 1.279876e+07 9.328280
Uganda 2013 3.755373e+07 9.326047
Angola 2001 1.698327e+07 9.324651
Seychelles 2002 8.372300e+04 9.308175
Eritrea 2001 3.497124e+06 9.307456
Burkina Faso 2008 1.468973e+07 9.306981
Madagascar 2002 1.676512e+07 9.304744
Malawi 2008 1.427123e+07 9.304371
Burkina Faso 2006 1.382918e+07 9.303419
Malawi 2011 1.562762e+07 9.303052
Chad 2013 1.313359e+07 9.299764
Jordan 2015 9.159320e+05 9.298777
South Sudan 2015 1.188214e+07 9.296737
Malta 2007 4.672400e+04 9.296166
Mozambique 2012 2.567666e+06 9.295579
Mozambique 2013 2.643437e+07 9.295098
Madagascar 2007 1.943352e+07 9.291719
Timor-Leste 2004 9.966980e+05 9.290939
Mauritania 2006 3.226530e+05 9.284744
Benin 2010 9.199259e+06 9.284523
Senegal 2010 1.291623e+07 9.284301
Madagascar 2005 1.833672e+07 9.284215
Burundi 2014 9.891790e+05 9.284023
Madagascar 2011 2.174395e+07 9.280030
Afghanistan 2007 2.661679e+07 9.279353
Kenya 2003 3.413852e+06 9.278074
Benin 2012 9.729160e+05 9.275617
Kenya 2001 3.232148e+07 9.275523
Kenya 2011 4.248684e+07 9.274553
Togo 2002 5.251472e+06 9.273295
Mauritania 2002 2.873228e+06 9.271470
Nigeria 2012 1.672973e+08 9.271340
Cameroon 2006 1.789956e+07 9.270607
Solomon Islands 2001 4.238530e+05 9.270494
Nigeria 2010 1.585783e+08 9.269162
Cameroon 2015 2.283452e+07 9.267349
Mauritania 2007 3.312665e+06 9.266959
Togo 2004 5.534598e+06 9.265585
Gabon 2003 1.328146e+06 9.259994
Ghana 2002 1.992452e+07 9.258923
Guinea-Bissau 2011 1.596154e+06 9.258850
Nigeria 2003 1.319725e+08 9.256929
Ghana 2008 2.329864e+06 9.254142
Honduras 2002 6.863157e+06 9.253297
Iraq 2008 2.911142e+07 9.252546
Ghana 2011 2.512180e+07 9.248716
Vanuatu 2002 1.939560e+05 9.246500
Honduras 2012 8.556460e+05 9.245294
Tonga 2006 1.168900e+04 9.244522
Comoros 2008 6.572290e+05 9.243275
Philippines 2003 8.331954e+06 9.241780
Papua New Guinea 2007 6.627922e+06 9.239779
Syrian Arab Republic 2005 1.829461e+07 9.239685
Timor-Leste 2001 8.925310e+05 9.239322
Timor-Leste 2013 1.184366e+06 9.238649
Luxembourg 2014 5.563190e+05 9.238497
Sudan 2013 3.684992e+07 9.238386
Ghana 2013 2.634625e+07 9.238118
Sao Tome and Principe 2006 1.593280e+05 9.237615
Sierra Leone 2011 6.611692e+06 9.236846
Belize 2011 3.291920e+05 9.233524
Sao Tome and Principe 2008 1.669130e+05 9.233156
Liberia 2015 4.499621e+06 9.232528
Sierra Leone 2010 6.458720e+05 9.231474
Belize 2005 2.832770e+05 9.230669
Guatemala 2001 1.192495e+07 9.229481
Togo 2011 6.679282e+06 9.229361
Tajikistan 2011 7.815949e+06 9.228118
Burkina Faso 2001 1.194459e+07 9.227038
Tajikistan 2014 8.362745e+06 9.226042
Solomon Islands 2011 5.396140e+05 9.224028
Belize 2013 3.441810e+05 9.221882
Greece 2007 1.148473e+06 9.221187
Kiribati 2007 9.631100e+04 9.217590
Sudan 2010 3.438596e+07 9.216832
Guatemala 2011 1.494892e+07 9.215078
Bhutan 2005 6.566390e+05 9.214975
Pakistan 2002 1.446541e+08 9.214651
Sao Tome and Principe 2001 1.416220e+05 9.213616
Guatemala 2013 1.559621e+07 9.212587
Syrian Arab Republic 2001 1.676690e+07 9.212211
Malaysia 2002 2.419881e+07 9.210913
Papua New Guinea 2013 7.592865e+06 9.207714
Solomon Islands 2015 5.874820e+05 9.207492
Sao Tome and Principe 2014 1.912660e+05 9.203574
Chad 2003 9.353210e+05 9.201016
Rwanda 2009 9.977446e+06 9.200125
Afghanistan 2004 2.411898e+07 9.198942
Guinea 2005 9.679745e+06 9.197481
Rwanda 2007 9.447420e+05 9.196011
Namibia 2011 2.215621e+06 9.195341
Guinea-Bissau 2009 1.517448e+06 9.195094
Maldives 2010 3.670000e+02 9.194444
Malaysia 2005 2.565939e+07 9.192738
Namibia 2015 2.425561e+06 9.191775
Mongolia 2014 2.923896e+06 9.190738
Seychelles 2008 8.695600e+04 9.190554
Uganda 2001 2.485489e+07 9.189463
Botswana 2014 2.168573e+06 9.187934
Luxembourg 2009 4.977830e+05 9.186903
Algeria 2010 3.611764e+07 9.183805
Seychelles 2013 8.994900e+04 9.183290
Cabo Verde 2001 4.437160e+05 9.181877
Panama 2008 3.516268e+06 9.180661
Turkmenistan 2012 5.267839e+06 9.180166
Mongolia 2011 2.761516e+06 9.180141
Tajikistan 2001 6.327125e+06 9.178363
Guinea-Bissau 2006 1.412669e+06 9.174945
Lebanon 2009 4.183156e+06 9.174356
Nepal 2001 2.416178e+07 9.173761
Eritrea 2011 4.474690e+05 9.173449
Zimbabwe 2008 1.355847e+07 9.171402
Djibouti 2010 8.511460e+05 9.170953
Cambodia 2002 1.263473e+07 9.169017
Namibia 2010 2.173170e+05 9.167353
Philippines 2012 9.686664e+07 9.166744
Malaysia 2010 2.811229e+07 9.165785
Kiribati 2001 8.585800e+04 9.165522
Malawi 2003 1.233669e+07 9.164435
Philippines 2004 8.467849e+07 9.163101
Ecuador 2011 1.517736e+07 9.162484
Philippines 2010 9.372662e+07 9.162404
Uzbekistan 2009 2.776740e+05 9.160787
Rwanda 2004 8.818438e+06 9.155443
Botswana 2001 1.754935e+06 9.153876
Mexico 2010 1.173189e+08 9.152888
Turkey 2001 6.419147e+07 9.150260
Haiti 2010 9.999617e+06 9.148938
Cyprus 2002 9.769660e+05 9.146923
Turkey 2012 7.456987e+07 9.146312
Botswana 2005 1.855852e+06 9.144982
Costa Rica 2005 4.247841e+06 9.144389
South Africa 2013 5.331196e+07 9.142071
Mongolia 2008 2.628131e+06 9.140685
Afghanistan 2010 2.883167e+06 9.140178
Algeria 2005 3.328844e+07 9.139034
Australia 2015 2.378934e+07 9.137384
Indonesia 2007 2.329891e+08 9.137093
Iceland 2005 2.967340e+05 9.136435
Dominican Republic 2009 9.767758e+06 9.136188
Ecuador 2008 1.444756e+07 9.135418
Luxembourg 2005 4.651580e+05 9.135265
Mexico 2015 1.258995e+07 9.135072
South Africa 2007 4.888384e+07 9.134760
Paraguay 2014 6.552584e+06 9.134314
Cabo Verde 2004 4.676640e+05 9.134223
Ecuador 2015 1.614437e+07 9.133856
Turkmenistan 2010 5.872100e+04 9.133046
Azerbaijan 2012 9.295784e+06 9.132948
Indonesia 2010 2.425241e+08 9.132835
Costa Rica 2009 4.488263e+06 9.132480
Iceland 2001 2.849680e+05 9.132196
Turkmenistan 2008 4.935762e+06 9.132185
Israel 2009 7.485600e+04 9.132106
Algeria 2001 3.159215e+07 9.130996
Nicaragua 2009 5.666581e+06 9.128734
Indonesia 2012 2.488832e+08 9.126462
Turkmenistan 2007 4.871370e+05 9.124431
Australia 2002 1.965140e+05 9.122804
Turkey 2006 6.876345e+06 9.122007
Peru 2003 2.693774e+07 9.121387
Turkey 2007 6.959728e+07 9.121261
Papua New Guinea 2011 7.269348e+06 9.121071
Nepal 2014 2.832324e+07 9.120753
India 2014 1.293859e+09 9.119642
Honduras 2008 7.872658e+06 9.119462
Azerbaijan 2015 9.649341e+06 9.119079
South Africa 2006 4.823384e+06 9.118976
Honduras 2013 8.657785e+06 9.118419
Bangladesh 2013 1.575713e+08 9.118397
Swaziland 2007 1.138434e+06 9.118154
Dominican Republic 2005 9.237566e+06 9.117838
Nicaragua 2013 5.945747e+06 9.116666
Tunisia 2015 1.127366e+07 9.116369
Bangladesh 2011 1.539119e+08 9.115860
Canada 2013 3.515545e+07 9.115090
Switzerland 2011 7.912398e+06 9.111704
Morocco 2011 3.285882e+07 9.111530
Colombia 2009 4.541618e+07 9.111485
Indonesia 2014 2.551311e+08 9.111305
Cyprus 2011 1.124835e+06 9.109332
Suriname 2005 4.989460e+05 9.107692
Chile 2006 1.631979e+07 9.106611
Sweden 2015 9.799186e+06 9.106307
Suriname 2011 5.315890e+05 9.103758
Tonga 2009 1.364000e+03 9.103704
Nepal 2007 2.621485e+07 9.103548
Haiti 2009 9.852870e+05 9.102503
Fiji 2009 8.519670e+05 9.102296
Tajikistan 2009 7.472819e+06 9.102117
Ecuador 2005 1.373523e+07 9.102058
Suriname 2003 4.883320e+05 9.101191
Brazil 2009 1.948960e+08 9.099322
Suriname 2014 5.479280e+05 9.099311
Afghanistan 2001 2.966463e+06 9.098391
Guinea 2001 8.971139e+06 9.096426
Gabon 2006 1.444844e+06 9.094909
Thailand 2001 6.354332e+07 9.092937
Kazakhstan 2004 1.512985e+06 9.092084
Colombia 2012 4.688148e+07 9.089315
Spain 2009 4.636295e+07 9.088955
Rwanda 2001 8.329460e+05 9.087389
Argentina 2004 3.872870e+07 9.087229
Seychelles 2007 8.533000e+03 9.086288
Canada 2006 3.257550e+05 9.081549
Tunisia 2002 9.864326e+06 9.080338
France 2010 6.527512e+06 9.077302
Sweden 2008 9.219637e+06 9.077295
Lesotho 2004 1.933728e+06 9.076906
Samoa 2015 1.937590e+05 9.076395
Samoa 2011 1.876650e+05 9.075973
Sri Lanka 2007 1.966800e+04 9.075820
Azerbaijan 2002 8.171950e+05 9.074896
Panama 2015 3.969249e+06 9.074594
Djibouti 2005 7.832540e+05 9.073747
Cyprus 2013 1.143896e+06 9.072876
Fiji 2015 8.921490e+05 9.070993
France 2005 6.317936e+07 9.068589
Jamaica 2002 2.695446e+06 9.068492
Malta 2002 3.959690e+05 9.068374
Norway 2005 4.623291e+06 9.068340
Italy 2008 5.882673e+07 9.066467
Fiji 2012 8.735960e+05 9.066094
Haiti 2007 9.556889e+06 9.065636
Italy 2004 5.768533e+07 9.064923
China 2004 1.296750e+05 9.064809
Guyana 2012 7.539100e+04 9.064210
Switzerland 2001 7.229854e+06 9.063478
France 2003 6.224488e+07 9.063411
Nigeria 2009 1.544218e+07 9.061764
Thailand 2006 6.582416e+07 9.060939
Guyana 2014 7.633930e+05 9.060397
Sweden 2014 9.696110e+05 9.060397
Mauritania 2011 3.717672e+06 9.060188
Israel 2005 6.931000e+03 9.059507
India 2011 1.247236e+08 9.059419
Slovenia 2015 2.635310e+05 9.059203
Samoa 2001 1.755660e+05 9.054751
France 2014 6.633196e+07 9.050514
Thailand 2009 6.688187e+07 9.050508
Netherlands 2010 1.661539e+07 9.049301
Burkina Faso 2005 1.342193e+06 9.048686
Niger 2012 1.773163e+07 9.048324
Finland 2008 5.313399e+06 9.046663
El Salvador 2012 6.221246e+06 9.046323
Jamaica 2007 2.775467e+06 9.045885
Afghanistan 2006 2.589345e+06 9.044085
Turkmenistan 2005 4.754641e+06 9.043644
Jamaica 2011 2.829493e+06 9.043600
Denmark 2010 5.547683e+06 9.042964
Trinidad and Tobago 2002 1.277837e+06 9.042888
Japan 2008 1.286300e+04 9.041374
Malta 2011 4.162680e+05 9.040716
Thailand 2014 6.841677e+07 9.040080
Chile 2013 1.746298e+07 9.037662
Ireland 2001 3.866243e+06 9.037653
El Salvador 2014 6.281189e+06 9.037424
Finland 2006 5.266268e+06 9.036798
Netherlands 2012 1.675496e+07 9.036674
Mauritius 2008 1.244121e+06 9.036229
Nepal 2011 2.732715e+07 9.035172
Netherlands 2004 1.628178e+07 9.034797
China 2003 1.288400e+04 9.034268
Austria 2011 8.391643e+06 9.033722
Ireland 2011 4.576794e+06 9.033418
Sweden 2002 8.924958e+06 9.032597
Jamaica 2015 2.871934e+06 9.031661
Norway 2001 4.513751e+06 9.031293
Thailand 2005 6.542547e+06 9.031027
Finland 2004 5.228172e+06 9.028835
Denmark 2012 5.591572e+06 9.028430
Mauritius 2012 1.255882e+06 9.027482
Uruguay 2007 3.339741e+06 9.024947
Malawi 2006 1.342926e+07 9.023999
Netherlands 2007 1.638170e+07 9.021770
Samoa 2007 1.822860e+05 9.019017
Sierra Leone 2008 6.165372e+06 9.018202
Russian Federation 2014 1.438197e+08 9.017452
Germany 2001 8.234992e+07 9.016828
Angola 2013 2.599834e+06 9.014190
Central African Republic 2003 3.981665e+06 9.013946
Uruguay 2009 3.362755e+06 9.013445
Trinidad and Tobago 2005 1.296934e+06 9.012228
Central African Republic 2013 4.499653e+06 9.012223
Bosnia and Herzegovina 2001 3.771284e+06 9.012010
France 2011 6.534278e+07 9.010365
Montenegro 2007 6.158750e+05 9.010158
Slovenia 2002 1.994530e+05 9.009686
Bosnia and Herzegovina 2003 3.779247e+06 9.008944
Slovenia 2003 1.995733e+06 9.006031
Montenegro 2015 6.221590e+05 9.005613
Bosnia and Herzegovina 2013 3.649990e+05 9.004907
Russian Federation 2009 1.427853e+08 9.003012
Uruguay 2005 3.325612e+06 9.001961
Uruguay 2002 3.327773e+06 9.001933
Nicaragua 2012 5.877180e+05 8.998265
Croatia 2004 4.439000e+03 8.997748
Bosnia and Herzegovina 2006 3.779468e+06 8.994547
Germany 2005 8.246942e+07 8.994324
Paraguay 2011 6.293783e+06 8.992083
Russian Federation 2008 1.427424e+07 8.991849
Spain 2015 4.644770e+07 8.991154
Croatia 2007 4.436000e+03 8.990991
Panama 2012 3.772938e+06 8.987077
Guyana 2001 7.522630e+05 8.986101
Germany 2010 8.177693e+06 8.982085
Belgium 2009 1.796493e+06 8.982014
Tonga 2001 9.861100e+04 8.978850
Guyana 2007 7.478690e+05 8.976775
Belarus 2011 9.473172e+06 8.976139
Swaziland 2011 1.225258e+06 8.974178
Georgia 2015 3.717100e+04 8.973437
Croatia 2010 4.417781e+06 8.972913
Thailand 2003 6.455495e+07 8.972705
Hungary 2014 9.866468e+06 8.972354
Spain 2014 4.648882e+06 8.970899
Serbia 2004 7.463157e+06 8.969606
Cabo Verde 2007 4.864380e+05 8.969013
Poland 2015 3.798641e+07 8.965649
Kazakhstan 2014 1.728922e+07 8.963391
Ukraine 2012 4.559330e+05 8.963353
Costa Rica 2007 4.369469e+06 8.957905
Belarus 2008 9.527985e+06 8.956586
Croatia 2012 4.267558e+06 8.956460
Mozambique 2001 1.858876e+07 8.952823
Denmark 2005 5.419432e+06 8.952623
Serbia 2006 7.411569e+06 8.951500
Norway 2008 4.768212e+06 8.951335
Serbia 2013 7.164132e+06 8.950501
Serbia 2002 7.496522e+06 8.949819
Belarus 2001 9.928549e+06 8.948835
Canada 2011 3.434278e+06 8.946529
Estonia 2005 1.354775e+06 8.942938
Vanuatu 2010 2.362950e+05 8.934623
Panama 2004 3.269541e+06 8.932562
Belarus 2005 9.663915e+06 8.930591
Myanmar 2001 4.662799e+07 8.930438
Romania 2003 2.157433e+07 8.926094
Bulgaria 2005 7.658972e+06 8.924985
Cameroon 2013 2.165572e+07 8.922967
Montenegro 2013 6.212700e+04 8.922856
Ukraine 2006 4.678775e+06 8.922855
Poland 2014 3.811735e+06 8.921329
Burkina Faso 2014 1.758598e+07 8.920319
Indonesia 2004 2.236146e+08 8.918498
Lithuania 2003 3.415213e+06 8.917364
Armenia 2007 2.933560e+05 8.915701
Romania 2015 1.981548e+07 8.912801
Finland 2003 5.213140e+05 8.911289
Bosnia and Herzegovina 2011 3.688865e+06 8.908739
Mali 2004 1.239196e+06 8.903427
Portugal 2011 1.557560e+05 8.901214
Argentina 2013 4.253992e+07 8.900514
Latvia 2003 2.287955e+06 8.897155
Iceland 2013 3.237640e+05 8.896198
Estonia 2004 1.362550e+05 8.893625
Jamaica 2010 2.817210e+05 8.891194
Zambia 2006 1.238345e+07 8.889699
Italy 2010 5.927742e+07 8.887207
Lithuania 2007 3.231294e+06 8.881663
Burundi 2012 9.319710e+05 8.876969
Bulgaria 2013 7.265115e+06 8.872583
Georgia 2006 4.136000e+03 8.871122
Mali 2011 1.554989e+06 8.867621
Malta 2001 3.932800e+04 8.864058
Georgia 2004 4.245000e+03 8.849188
Armenia 2015 2.916950e+05 8.847242
Bangladesh 2010 1.521491e+07 8.844794
Lithuania 2005 3.322528e+06 8.836512
Sao Tome and Principe 2010 1.747760e+05 8.811711
Cameroon 2003 1.651382e+07 8.801151
Guatemala 2009 1.431628e+06 8.781151
Kenya 2010 4.135152e+06 8.759070
Hungary 2003 1.129552e+06 8.748610
Seychelles 2011 8.744100e+04 8.740559
Rwanda 2014 1.134536e+07 8.737242
Hungary 2001 1.187576e+06 8.736544
Zambia 2009 1.345642e+07 8.733274
Togo 2014 7.228915e+06 8.730042
Portugal 2012 1.514844e+06 8.725751
Sri Lanka 2004 1.922800e+04 8.696420
Belgium 2012 1.112825e+07 8.695756
Zimbabwe 2011 1.438665e+07 8.679395
Ethiopia 2003 7.254514e+07 8.676309
Kiribati 2005 9.232500e+04 8.675645
Cambodia 2010 1.438740e+05 8.637862
Albania 2006 2.992547e+06 8.607293
Portugal 2015 1.358760e+05 8.594408
Tunisia 2005 1.124820e+05 8.563983
Israel 2014 8.215700e+04 8.558697
Azerbaijan 2001 8.111200e+04 8.558331
Paraguay 2009 6.127837e+06 8.469442
Timor-Leste 2011 1.131523e+06 8.461607
Comoros 2012 7.238680e+05 8.453800
Benin 2002 7.295394e+06 8.392409
Bhutan 2003 6.234340e+05 8.389208
Latvia 2014 1.993782e+06 8.376018
Armenia 2002 3.338970e+05 8.364661
Iceland 2007 3.115660e+05 8.222841
Austria 2003 8.121423e+06 8.208412
Switzerland 2014 8.188649e+06 8.207495
Maldives 2004 3.120000e+02 8.176471
Lithuania 2012 2.987773e+06 8.105871
Luxembourg 2011 5.183470e+05 8.101312
Central African Republic 2005 4.127910e+05 8.065157
Dominican Republic 2012 1.154950e+05 8.026573
Latvia 2012 2.343190e+05 8.019554
Eritrea 2007 4.153332e+06 7.900353
France 2001 6.135743e+07 7.876304
Hungary 2008 1.381880e+05 7.870715
Norway 2014 5.137232e+06 7.863058
Turkmenistan 2011 5.174610e+05 7.812197
Montenegro 2003 6.122670e+05 7.768216
Brazil 2013 2.248632e+06 7.750120
Belize 2009 3.139290e+05 7.680465
South Sudan 2011 1.448857e+06 7.665827
South Africa 2011 5.172935e+07 7.651213
Cameroon 2012 2.182383e+06 7.644916
Myanmar 2013 5.144820e+07 7.594016
Liberia 2003 3.116233e+06 7.587905
Afghanistan 2013 3.173169e+07 7.583189
Iraq 2011 3.172753e+06 7.432095
Malta 2009 4.124770e+05 7.353288
Uzbekistan 2015 3.129890e+05 7.329271
Slovenia 2007 2.181220e+05 7.118282
Panama 2002 3.149265e+06 7.081587
Morocco 2007 3.122588e+07 7.070067
Kiribati 2010 1.265200e+04 7.068878
Namibia 2005 2.321960e+05 6.944300
Peru 2015 3.137667e+07 6.896772
Tonga 2005 1.141000e+03 6.815068
Australia 2008 2.124920e+05 6.514924
Afghanistan 2002 2.197992e+07 6.409471
Lesotho 2013 2.117361e+06 6.303058
Guinea 2008 1.323142e+06 5.725777
Timor-Leste 2010 1.195910e+05 5.221893
Equatorial Guinea 2014 1.129424e+06 5.146659
Swaziland 2005 1.158730e+05 4.926098
Senegal 2005 1.125127e+07 4.752346
Cyprus 2010 1.112670e+05 4.598058
South Africa 2004 4.717990e+03 -0.999898
Iraq 2014 3.568000e+03 -0.999895
Ukraine 2002 4.822500e+04 -0.999009
Bosnia and Herzegovina 2007 3.774000e+03 -0.999001
Mauritius 2010 1.254000e+03 -0.998995
Myanmar 2009 4.986900e+04 -0.998992
Kazakhstan 2008 1.567400e+04 -0.998988
Algeria 2007 3.437600e+04 -0.998982
Mexico 2011 1.199170e+05 -0.998978
Uruguay 2013 3.485000e+03 -0.998974
Cameroon 2008 1.897800e+04 -0.998968
Ethiopia 2007 8.149000e+03 -0.998967
Senegal 2011 1.339100e+04 -0.998963
Niger 2002 1.226200e+04 -0.998958
Ecuador 2002 1.372600e+04 -0.998932
Turkey 2008 7.443200e+04 -0.998931
India 2015 1.395398e+06 -0.998922
Tajikistan 2006 7.557000e+03 -0.998897
El Salvador 2004 6.775000e+03 -0.998865
Cyprus 2004 1.141000e+03 -0.998852
Bhutan 2008 7.950000e+02 -0.998843
Turkmenistan 2009 5.795000e+03 -0.998826
Tunisia 2004 1.176100e+04 -0.998817
Dominican Republic 2011 1.279500e+04 -0.998707
Brazil 2012 2.569830e+05 -0.998707
Tonga 2004 1.460000e+02 -0.998537
Syrian Arab Republic 2012 2.427100e+04 -0.991525
Syrian Arab Republic 2014 1.923900e+04 -0.990328
Portugal 2014 1.416200e+04 -0.990282
Nicaragua 2001 5.175000e+03 -0.990176
Bosnia and Herzegovina 2012 3.648200e+04 -0.990110
Armenia 2006 2.958500e+04 -0.990076
Albania 2007 2.971700e+04 -0.990070
Bulgaria 2006 7.612200e+04 -0.990061
Albania 2012 2.941000e+03 -0.990037
Germany 2008 8.211970e+05 -0.990018
Croatia 2015 4.236400e+04 -0.990005
Poland 2007 3.812560e+05 -0.990004
Russian Federation 2004 1.446754e+06 -0.989998
Slovenia 2001 1.992600e+04 -0.989982
Japan 2007 1.281000e+03 -0.989981
Hungary 2010 1.230000e+02 -0.989971
Portugal 2010 1.573100e+04 -0.989969
Guyana 2011 7.491000e+03 -0.989966
Thailand 2015 6.865760e+05 -0.989965
Trinidad and Tobago 2010 1.328100e+04 -0.989951
Lithuania 2015 2.949100e+04 -0.989943
Thailand 2010 6.728880e+05 -0.989939
Latvia 2007 2.232500e+04 -0.989936
Central African Republic 2015 4.546100e+04 -0.989932
Switzerland 2003 7.339100e+04 -0.989925
Russian Federation 2015 1.449687e+06 -0.989920
Kazakhstan 2003 1.499180e+05 -0.989911
China 2002 1.284000e+03 -0.989904
Belgium 2014 1.129570e+05 -0.989899
Thailand 2004 6.522310e+05 -0.989896
Turkmenistan 2001 4.564800e+04 -0.989892
Trinidad and Tobago 2015 1.369200e+04 -0.989891
Myanmar 2002 4.714220e+05 -0.989890
Uzbekistan 2005 2.616700e+04 -0.989883
Turkmenistan 2006 4.811500e+04 -0.989880
Sweden 2013 9.637900e+04 -0.989875
Peru 2005 2.761410e+05 -0.989875
Zimbabwe 2005 1.294320e+05 -0.989870
Montenegro 2011 6.279000e+03 -0.989863
Mexico 2014 1.242216e+06 -0.989862
Brazil 2001 1.777567e+06 -0.989859
Uzbekistan 2007 2.686800e+04 -0.989857
Kazakhstan 2011 1.655660e+05 -0.989856
Cabo Verde 2013 5.216000e+03 -0.989852
Zimbabwe 2002 1.255250e+05 -0.989849
Cambodia 2008 1.388590e+05 -0.989847
Peru 2009 2.915700e+04 -0.989820
Honduras 2011 8.351600e+04 -0.989809
Bangladesh 2001 1.341716e+06 -0.989803
Haiti 2008 9.752900e+04 -0.989795
Turkey 2014 7.736280e+05 -0.989792
Seychelles 2006 8.460000e+02 -0.989790
Philippines 2002 8.135260e+05 -0.989788
Pakistan 2003 1.477341e+06 -0.989787
Central African Republic 2009 4.442300e+04 -0.989777
Sao Tome and Principe 2011 1.788000e+03 -0.989770
Myanmar 2004 4.873770e+05 -0.989766
Costa Rica 2013 4.764100e+04 -0.989764
Sierra Leone 2009 6.312600e+04 -0.989761
Bhutan 2011 7.451000e+03 -0.989760
Comoros 2013 7.415000e+03 -0.989756
Nicaragua 2011 5.878200e+04 -0.989755
Canada 2010 3.452740e+05 -0.989733
Ethiopia 2010 8.772670e+05 -0.989730
Ghana 2007 2.272120e+05 -0.989725
Kenya 2002 3.321490e+05 -0.989724
Guinea 2014 1.185590e+05 -0.989723
Bhutan 2001 5.896000e+03 -0.989718
Kenya 2006 3.752500e+04 -0.989714
Eritrea 2005 3.969700e+04 -0.989712
Burkina Faso 2002 1.229310e+05 -0.989708
Benin 2011 9.468200e+04 -0.989708
Zambia 2010 1.385330e+05 -0.989705
Israel 2004 6.890000e+02 -0.989701
Benin 2004 7.754000e+03 -0.989696
Rwanda 2006 9.265800e+04 -0.989695
Kenya 2014 4.624250e+05 -0.989684
Greece 2002 1.922200e+04 -0.989677
Afghanistan 2014 3.275820e+05 -0.989677
Portugal 2005 1.533300e+04 -0.989667
Netherlands 2001 1.646180e+05 -0.989663
Gabon 2010 1.642100e+04 -0.989651
Mozambique 2015 2.816910e+05 -0.989648
South Sudan 2007 8.856800e+04 -0.989541
Belgium 2007 1.625700e+04 -0.989498
Papua New Guinea 2003 6.172400e+04 -0.989471
Burkina Faso 2004 1.335690e+05 -0.989445
Jordan 2014 8.893600e+04 -0.989429
Sweden 2006 9.855000e+03 -0.989398
Honduras 2009 8.352100e+04 -0.989391
Syrian Arab Republic 2002 1.787910e+05 -0.989337
Uganda 2015 4.144870e+05 -0.989327
Pakistan 2004 1.578300e+04 -0.989317
Afghanistan 2005 2.577980e+05 -0.989311
Mali 2012 1.666700e+04 -0.989282
Angola 2012 2.596150e+05 -0.989280
Mali 2003 1.251280e+05 -0.989249
Kenya 2009 4.237240e+05 -0.989176
Malawi 2012 1.697350e+05 -0.989139
Suriname 2006 5.437000e+03 -0.989103
Guatemala 2005 1.396280e+05 -0.989089
Tonga 2008 1.350000e+02 -0.989075
Swaziland 2002 1.893000e+03 -0.989053
Chad 2015 1.494130e+05 -0.988989
Morocco 2005 3.521700e+04 -0.988923
Philippines 2014 1.122490e+05 -0.988602
Central African Republic 2004 4.553600e+04 -0.988564
Haiti 2011 1.145540e+05 -0.988544
Lebanon 2006 4.573500e+04 -0.988529
Costa Rica 2002 4.632400e+04 -0.988410
Mexico 2001 1.367680e+05 -0.988330
Mozambique 2004 2.312750e+05 -0.988270
Ireland 2004 4.726200e+04 -0.988174
Liberia 2011 4.716700e+04 -0.988053
Senegal 2003 1.679900e+04 -0.987974
Romania 2012 2.583500e+04 -0.987970
Slovenia 2005 2.474000e+03 -0.987612
Iraq 2010 3.762710e+05 -0.987413
Namibia 2004 2.922800e+04 -0.985287
Benin 2013 1.445100e+04 -0.985147
Chad 2005 1.679000e+03 -0.982716
Guinea 2007 1.967270e+05 -0.980091
Greece 2006 1.123620e+05 -0.943460
Zambia 2002 1.112490e+05 -0.939012
Mexico 2006 1.192378e+06 -0.935450
Ghana 2005 2.154290e+05 -0.927867
Sudan 2006 3.167640e+05 -0.919026
Latvia 2011 2.597900e+04 -0.912692
Hungary 2009 1.226500e+04 -0.911244
Georgia 2008 4.300000e+01 -0.910788
El Salvador 2008 6.113100e+04 -0.910559
Mauritania 2005 3.137200e+04 -0.908489
Greece 2009 1.117170e+05 -0.905151
Armenia 2003 3.178600e+04 -0.904803
Armenia 2001 3.565500e+04 -0.903528
Greece 2015 1.828830e+05 -0.903360
Croatia 2011 4.286220e+05 -0.902978
Albania 2002 3.511000e+03 -0.902939
Cambodia 2005 1.327210e+05 -0.902653
Hungary 2002 1.158680e+05 -0.902433
Bosnia and Herzegovina 2014 3.566200e+04 -0.902296
Lithuania 2006 3.269990e+05 -0.901581
Romania 2001 2.213197e+06 -0.901386
Georgia 2005 4.190000e+02 -0.901296
Namibia 2009 2.137400e+04 -0.901218
Lithuania 2004 3.377750e+05 -0.901097
Georgia 2003 4.310000e+02 -0.901079
Serbia 2011 7.234990e+05 -0.900774
Ukraine 2004 4.745160e+05 -0.900756
Bulgaria 2004 7.716860e+05 -0.900752
Georgia 2002 4.357000e+03 -0.900670
Belarus 2004 9.731460e+05 -0.900666
Bosnia and Herzegovina 2010 3.722840e+05 -0.900633
Estonia 2002 1.379350e+05 -0.900631
Lithuania 2001 3.478180e+05 -0.900610
Romania 2006 2.119376e+06 -0.900591
Estonia 2006 1.346810e+05 -0.900588
Maldives 2009 3.600000e+01 -0.900552
Poland 2013 3.841960e+05 -0.900549
Albania 2010 2.913210e+05 -0.900489
Hungary 2012 9.923620e+05 -0.900482
Russian Federation 2007 1.428588e+06 -0.900444
Russian Federation 2001 1.459768e+07 -0.900423
Russian Federation 2002 1.453646e+06 -0.900419
Ukraine 2007 4.659350e+05 -0.900415
Serbia 2008 7.352220e+05 -0.900398
Serbia 2014 7.135760e+05 -0.900396
Serbia 2009 7.328700e+04 -0.900320
Spain 2013 4.662450e+05 -0.900319
El Salvador 2002 5.943300e+04 -0.900274
Ukraine 2015 4.515429e+06 -0.900260
Hungary 2015 9.843280e+05 -0.900235
Serbia 2005 7.447690e+05 -0.900207
Estonia 2003 1.377200e+04 -0.900156
Belarus 2006 9.649240e+05 -0.900152
Serbia 2003 7.485910e+05 -0.900142
Albania 2015 2.887300e+04 -0.900064
Croatia 2006 4.440000e+02 -0.900045
Iceland 2010 3.184100e+04 -0.900028
Poland 2001 3.824876e+06 -0.900026
Germany 2004 8.251626e+06 -0.900022
Uruguay 2004 3.324960e+05 -0.900020
Russian Federation 2006 1.434953e+07 -0.900016
Estonia 2007 1.346800e+04 -0.900001
Bosnia and Herzegovina 2005 3.781530e+05 -0.899994
Romania 2014 1.998979e+06 -0.899969
Poland 2003 3.824570e+05 -0.899968
Italy 2001 5.697410e+05 -0.899944
Spain 2012 4.677355e+06 -0.899934
Poland 2009 3.815163e+06 -0.899932
Slovenia 2004 1.997120e+05 -0.899931
Estonia 2015 1.315470e+05 -0.899930
Russian Federation 2011 1.429687e+07 -0.899917
Seychelles 2001 8.122000e+03 -0.899890
Bosnia and Herzegovina 2002 3.775870e+05 -0.899878
Netherlands 2014 1.686580e+05 -0.899872
Bulgaria 2012 7.358880e+05 -0.899856
Netherlands 2006 1.634611e+06 -0.899839
Montenegro 2006 6.152500e+04 -0.899839
Uruguay 2006 3.331430e+05 -0.899825
Uruguay 2001 3.327130e+05 -0.899823
Japan 2010 1.287000e+03 -0.899821
Serbia 2001 7.534330e+05 -0.899761
Austria 2010 8.363440e+05 -0.899759
Finland 2001 5.188800e+04 -0.899758
Greece 2011 1.114899e+06 -0.899751
Croatia 2001 4.440000e+02 -0.899684
Cambodia 2014 1.527790e+05 -0.899665
Trinidad and Tobago 2001 1.272380e+05 -0.899653
Finland 2005 5.246960e+05 -0.899641
Denmark 2003 5.395740e+05 -0.899632
Central African Republic 2012 4.494160e+05 -0.899598
Armenia 2013 2.893590e+05 -0.899595
Belarus 2009 9.567650e+05 -0.899584
Finland 2007 5.288720e+05 -0.899574
France 2015 6.662468e+06 -0.899559
Thailand 2013 6.814365e+06 -0.899558
El Salvador 2011 6.192560e+05 -0.899547
Mauritius 2007 1.239630e+05 -0.899543
Netherlands 2011 1.669374e+06 -0.899528
Netherlands 2003 1.622532e+06 -0.899527
France 2012 6.565979e+06 -0.899515
China 2009 1.331260e+05 -0.899501
Denmark 2011 5.575720e+05 -0.899495
Italy 2007 5.843831e+06 -0.899494
Malta 2010 4.145800e+04 -0.899490
Trinidad and Tobago 2003 1.284520e+05 -0.899477
Thailand 2008 6.654576e+06 -0.899471
Netherlands 2013 1.684432e+06 -0.899467
Netherlands 2009 1.653388e+06 -0.899463
Denmark 2009 5.523950e+05 -0.899448
Uruguay 2008 3.358240e+05 -0.899446
Jamaica 2008 2.791220e+05 -0.899432
Ireland 2010 4.561550e+05 -0.899423
El Salvador 2013 6.257770e+05 -0.899413
Norway 2004 4.591910e+05 -0.899407
Ukraine 2009 4.653300e+04 -0.899406
France 2009 6.477440e+05 -0.899380
Mauritius 2004 1.221300e+04 -0.899346
Jamaica 2006 2.762790e+05 -0.899340
Fiji 2014 8.858600e+04 -0.899301
Jamaica 2012 2.849920e+05 -0.899278
China 2001 1.271850e+05 -0.899271
Cyprus 2014 1.152390e+05 -0.899257
Sri Lanka 2006 1.952000e+03 -0.899241
Azerbaijan 2003 8.234100e+04 -0.899239
Jamaica 2001 2.677110e+05 -0.899238
Samoa 2010 1.862500e+04 -0.899230
Slovenia 2014 2.619800e+04 -0.899220
France 2002 6.185267e+06 -0.899193
France 2004 6.274897e+06 -0.899190
Mauritius 2014 1.269340e+05 -0.899151
France 2007 6.416229e+06 -0.899150
Canada 2015 3.584861e+06 -0.899145
Lesotho 2008 1.999930e+05 -0.899110
Tunisia 2001 9.785710e+05 -0.899108
Fiji 2010 8.599500e+04 -0.899063
Norway 2006 4.666770e+05 -0.899060
Cyprus 2012 1.135620e+05 -0.899041
Suriname 2015 5.532800e+04 -0.899023
Myanmar 2015 5.243669e+06 -0.899013
Norway 2015 5.188670e+05 -0.898999
Nepal 2009 2.674113e+06 -0.898998
Seychelles 2012 8.833000e+03 -0.898983
Fiji 2008 8.433400e+04 -0.898978
Turkmenistan 2002 4.612000e+03 -0.898966
Guyana 2005 7.594600e+04 -0.898961
Switzerland 2010 7.824990e+05 -0.898952
Argentina 2006 3.955889e+06 -0.898944
Suriname 2004 4.936300e+04 -0.898915
Chile 2005 1.614764e+06 -0.898912
Italy 2002 5.759700e+04 -0.898907
Colombia 2010 4.591897e+06 -0.898893
Iceland 2004 2.927400e+04 -0.898888
Samoa 2006 1.819400e+04 -0.898882
Azerbaijan 2007 8.581300e+04 -0.898860
Bangladesh 2015 1.612886e+06 -0.898849
Suriname 2012 5.377700e+04 -0.898837
Bangladesh 2008 1.488581e+07 -0.898832
Bangladesh 2012 1.557275e+07 -0.898820
Brazil 2006 1.891241e+07 -0.898819
Bangladesh 2014 1.594528e+07 -0.898806
Nepal 2013 2.798531e+06 -0.898787
Colombia 2008 4.491544e+06 -0.898781
Luxembourg 2003 4.516300e+04 -0.898777
Peru 2008 2.864198e+06 -0.898766
Mongolia 2006 2.558120e+05 -0.898746
Costa Rica 2010 4.545280e+05 -0.898730
Algeria 2002 3.199546e+06 -0.898723
Azerbaijan 2013 9.416810e+05 -0.898698
Norway 2011 4.953880e+05 -0.898678
Indonesia 2011 2.457751e+07 -0.898660
Peru 2002 2.661467e+06 -0.898655
Indonesia 2009 2.393448e+07 -0.898651
Bangladesh 2006 1.453684e+06 -0.898649
Nicaragua 2006 5.452110e+05 -0.898647
Paraguay 2013 6.465740e+05 -0.898644
Costa Rica 2008 4.429580e+05 -0.898624
Central African Republic 2007 4.275800e+04 -0.898620
Indonesia 2013 2.523226e+07 -0.898618
Indonesia 2005 2.267127e+07 -0.898615
Chile 2012 1.739746e+06 -0.898577
Indonesia 2001 2.145652e+06 -0.898572
Morocco 2014 3.431882e+06 -0.898539
India 2008 1.197147e+08 -0.898519
Turkey 2002 6.514354e+06 -0.898517
Costa Rica 2004 4.187380e+05 -0.898512
Nicaragua 2003 5.248790e+05 -0.898510
Australia 2014 2.346694e+06 -0.898488
Dominican Republic 2003 8.967760e+05 -0.898466
Zimbabwe 2007 1.332999e+06 -0.898432
El Salvador 2001 5.959620e+05 -0.898432
Haiti 2005 9.263440e+05 -0.898418
Morocco 2010 3.249639e+06 -0.898417
Nepal 2003 2.495623e+06 -0.898413
Azerbaijan 2004 8.365000e+03 -0.898410
Kazakhstan 2006 1.538840e+05 -0.898408
Turkey 2011 7.349455e+06 -0.898386
South Africa 2012 5.256516e+06 -0.898384
Mexico 2008 1.136619e+07 -0.898368
Philippines 2011 9.527794e+06 -0.898345
Ecuador 2010 1.493469e+06 -0.898343
Armenia 2008 2.982200e+04 -0.898342
Mongolia 2010 2.712650e+05 -0.898337
Djibouti 2006 7.962800e+04 -0.898337
Philippines 2013 9.848132e+06 -0.898333
Paraguay 2003 5.679500e+04 -0.898328
Turkmenistan 2004 4.733980e+05 -0.898320
Djibouti 2009 8.368400e+04 -0.898310
Ecuador 2006 1.396748e+06 -0.898309
Ecuador 2014 1.593112e+06 -0.898279
South Africa 2015 5.511977e+06 -0.898203
Luxembourg 2008 4.886500e+04 -0.898196
Malaysia 2008 2.711169e+06 -0.898175
Dominican Republic 2007 9.543530e+05 -0.898163
Bangladesh 2003 1.391910e+05 -0.898153
Spain 2007 4.522683e+06 -0.898132
Thailand 2002 6.473164e+06 -0.898130
Philippines 2006 8.789419e+06 -0.898122
Malaysia 2012 2.917456e+06 -0.898116
Zimbabwe 2009 1.381599e+06 -0.898101
Cabo Verde 2002 4.521600e+04 -0.898097
Eritrea 2009 4.313340e+05 -0.898093
Italy 2009 5.995365e+06 -0.898084
Israel 2012 7.915000e+03 -0.898079
Swaziland 2014 1.295970e+05 -0.898072
Mongolia 2013 2.869170e+05 -0.898048
Lesotho 2002 1.923120e+05 -0.898029
Eritrea 2010 4.398400e+04 -0.898028
Panama 2005 3.334650e+05 -0.898009
Israel 2015 8.381000e+03 -0.897988
Malaysia 2003 2.468873e+06 -0.897975
Israel 2002 6.570000e+02 -0.897966
Jamaica 2009 2.848200e+04 -0.897959
Central African Republic 2001 3.832230e+05 -0.897943
Pakistan 2015 1.893851e+07 -0.897931
Cyprus 2001 9.628200e+04 -0.897929
Kiribati 2006 9.426000e+03 -0.897904
India 2010 1.239869e+07 -0.897892
Guatemala 2003 1.254780e+05 -0.897890
Sudan 2009 3.365619e+06 -0.897874
Solomon Islands 2014 5.755400e+04 -0.897866
Guinea-Bissau 2003 1.321220e+05 -0.897859
Suriname 2009 5.261900e+04 -0.897857
Guatemala 2012 1.527156e+06 -0.897842
Kiribati 2008 9.844000e+03 -0.897789
Timor-Leste 2012 1.156760e+05 -0.897770
Pakistan 2001 1.416144e+07 -0.897769
Cambodia 2001 1.242473e+06 -0.897759
Costa Rica 2011 4.647400e+04 -0.897753
Vanuatu 2014 2.588500e+04 -0.897745
Tajikistan 2010 7.641630e+05 -0.897741
Belize 2012 3.367100e+04 -0.897716
Tajikistan 2012 7.995620e+05 -0.897701
Zimbabwe 2012 1.471826e+06 -0.897695
Namibia 2013 2.316520e+05 -0.897677
Papua New Guinea 2012 7.438360e+05 -0.897675
Sierra Leone 2012 6.766130e+05 -0.897664
Sudan 2012 3.599192e+06 -0.897655
Pakistan 2009 1.674958e+06 -0.897647
Armenia 2014 2.962200e+04 -0.897629
Sao Tome and Principe 2007 1.631100e+04 -0.897626
Solomon Islands 2007 4.929400e+04 -0.897608
Luxembourg 2015 5.696400e+04 -0.897606
Swaziland 2009 1.186750e+05 -0.897597
Sao Tome and Principe 2005 1.556300e+04 -0.897591
Sudan 2015 3.864783e+06 -0.897589
Comoros 2007 6.416200e+04 -0.897574
Ghana 2012 2.573349e+06 -0.897565
Vanuatu 2008 2.253400e+04 -0.897551
Belize 2010 3.216800e+04 -0.897531
Papua New Guinea 2005 6.314790e+05 -0.897512
Djibouti 2004 7.775200e+04 -0.897508
Sao Tome and Principe 2013 1.874500e+04 -0.897506
Iraq 2007 2.839433e+06 -0.897486
Sudan 2001 2.794550e+05 -0.897468
Guinea-Bissau 2010 1.555880e+05 -0.897467
Costa Rica 2015 4.878520e+05 -0.897458
Afghanistan 2008 2.729431e+06 -0.897455
Ghana 2001 1.942165e+06 -0.897450
Nigeria 2002 1.286667e+07 -0.897447
Gabon 2002 1.294490e+05 -0.897447
Ethiopia 2015 9.987333e+06 -0.897426
Syrian Arab Republic 2004 1.786638e+06 -0.897410
Liberia 2014 4.397370e+05 -0.897409
Honduras 2001 6.693610e+05 -0.897405
Togo 2015 7.416820e+05 -0.897401
Germany 2013 8.645650e+05 -0.897391
Turkey 2003 6.685830e+05 -0.897368
Panama 2014 3.939860e+05 -0.897358
Nigeria 2006 1.426149e+07 -0.897355
Trinidad and Tobago 2006 1.331440e+05 -0.897339
Togo 2003 5.391410e+05 -0.897335
Liberia 2005 3.261230e+05 -0.897330
Iraq 2004 2.631669e+06 -0.897311
Cameroon 2014 2.223994e+06 -0.897302
Nigeria 2014 1.764652e+06 -0.897302
Cameroon 2004 1.695981e+06 -0.897299
Honduras 2014 8.892160e+05 -0.897293
Nigeria 2013 1.718293e+07 -0.897291
Greece 2012 1.145110e+05 -0.897290
Nigeria 2011 1.628778e+07 -0.897289
Gabon 2004 1.364250e+05 -0.897282
Madagascar 2014 2.358981e+06 -0.897262
Malawi 2004 1.267638e+06 -0.897246
Paraguay 2010 6.298770e+05 -0.897211
Cameroon 2010 1.997495e+06 -0.897209
Togo 2009 6.334720e+05 -0.897194
Cabo Verde 2006 4.879500e+04 -0.897180
Senegal 2013 1.412320e+05 -0.897175
Nicaragua 2004 5.397300e+04 -0.897171
Seychelles 2010 8.977000e+03 -0.897168
Benin 2009 8.944760e+05 -0.897150
Belize 2002 2.622600e+04 -0.897146
Guinea 2003 9.398480e+05 -0.897142
Ethiopia 2004 7.462445e+06 -0.897134
Guinea-Bissau 2015 1.775260e+05 -0.897131
Guinea-Bissau 2008 1.488410e+05 -0.897064
Mozambique 2011 2.493950e+05 -0.897035
Mozambique 2009 2.352463e+06 -0.897033
Madagascar 2006 1.888268e+06 -0.897023
Mali 2015 1.746795e+06 -0.897022
Gabon 2015 1.931750e+05 -0.897012
Burkina Faso 2015 1.811624e+06 -0.896985
Cambodia 2013 1.522692e+06 -0.896954
Timor-Leste 2015 1.249770e+05 -0.896953
Guatemala 2002 1.228848e+06 -0.896951
Burkina Faso 2007 1.425221e+06 -0.896941
Burkina Faso 2009 1.514199e+06 -0.896921
Senegal 2008 1.223957e+06 -0.896917
Zambia 2013 1.515321e+06 -0.896916
Mauritania 2012 3.832390e+05 -0.896914
Malawi 2009 1.471462e+06 -0.896893
Bhutan 2004 6.428200e+04 -0.896890
China 2005 1.337200e+04 -0.896881
Malawi 2007 1.384969e+06 -0.896869
Argentina 2012 4.296739e+06 -0.896854
Benin 2003 7.525550e+05 -0.896845
Madagascar 2004 1.782997e+06 -0.896812
Burundi 2013 9.618600e+04 -0.896793
Nicaragua 2014 6.139970e+05 -0.896733
Iceland 2015 3.381500e+04 -0.896712
Costa Rica 2006 4.387940e+05 -0.896702
Kazakhstan 2013 1.735275e+06 -0.896657
Mali 2006 1.322764e+06 -0.896649
Sweden 2005 9.295720e+05 -0.896640
Burundi 2009 8.489310e+05 -0.896626
Papua New Guinea 2010 7.182390e+05 -0.896618
Zambia 2015 1.615870e+05 -0.896616
Tajikistan 2008 7.397280e+05 -0.896576
Uganda 2002 2.571848e+06 -0.896525
Malta 2006 4.538000e+03 -0.896473
Malta 2004 4.126800e+04 -0.896463
Uganda 2005 2.854394e+06 -0.896462
Chad 2009 1.152786e+06 -0.896461
Botswana 2003 1.843390e+05 -0.896436
Panama 2011 3.777820e+05 -0.896306
Hungary 2004 1.171460e+05 -0.896290
Mali 2009 1.466597e+06 -0.896267
Angola 2003 1.823369e+06 -0.896238
Chad 2012 1.275135e+06 -0.896235
Malawi 2002 1.213711e+06 -0.896227
Niger 2006 1.413264e+06 -0.896224
Syrian Arab Republic 2007 1.963286e+06 -0.896205
Bangladesh 2009 1.545478e+06 -0.896178
Chad 2001 8.663120e+05 -0.896158
South Sudan 2003 7.516420e+05 -0.896143
Chad 2004 9.714300e+04 -0.896139
Germany 2014 8.982500e+04 -0.896104
Luxembourg 2012 5.394600e+04 -0.895927
Afghanistan 2009 2.843310e+05 -0.895828
Mauritius 2002 1.246210e+05 -0.895827
Equatorial Guinea 2002 6.664700e+04 -0.895825
Maldives 2014 4.100000e+01 -0.895674
Swaziland 2004 1.955300e+04 -0.895657
Samoa 2013 1.975700e+04 -0.895573
South Sudan 2009 9.676670e+05 -0.895536
Panama 2003 3.291740e+05 -0.895476
Trinidad and Tobago 2007 1.392600e+04 -0.895406
Italy 2013 6.233948e+06 -0.895298
Sri Lanka 2003 1.983000e+03 -0.895295
Nigeria 2008 1.534739e+06 -0.895180
Timor-Leste 2003 9.685200e+04 -0.895162
Mauritania 2008 3.475410e+05 -0.895087
Belize 2006 2.974700e+04 -0.894990
South Sudan 2005 8.188770e+05 -0.894849
Sierra Leone 2007 6.154170e+05 -0.894777
Jordan 2010 7.182390e+05 -0.894704
Cabo Verde 2010 5.238400e+04 -0.894592
Lebanon 2003 3.714640e+05 -0.894555
Zimbabwe 2013 1.554560e+05 -0.894379
Botswana 2015 2.291970e+05 -0.894310
Iraq 2005 2.784260e+05 -0.894202
Chad 2002 9.168900e+04 -0.894162
Uganda 2011 3.593648e+06 -0.894040
South Africa 2009 5.255813e+06 -0.893945
Cambodia 2004 1.363377e+06 -0.893926
Sri Lanka 2010 2.119000e+03 -0.893880
Peru 2012 3.158966e+06 -0.893852
Spain 2001 4.854120e+05 -0.893733
Morocco 2004 3.179285e+06 -0.893470
Azerbaijan 2010 9.543320e+05 -0.893338
Malawi 2014 1.768838e+06 -0.893297
Sao Tome and Principe 2009 1.781300e+04 -0.893280
Zambia 2005 1.252156e+06 -0.893268
Honduras 2003 7.338210e+05 -0.893078
Australia 2004 2.127400e+04 -0.893071
Burkina Faso 2013 1.772723e+06 -0.893024
Lebanon 2012 4.916440e+05 -0.892850
Brazil 2015 2.596218e+06 -0.892776
Kiribati 2004 9.542000e+03 -0.892660
Niger 2011 1.764636e+06 -0.892568
Mali 2010 1.575850e+05 -0.892551
Burkina Faso 2011 1.681940e+05 -0.892543
Cambodia 2009 1.492800e+04 -0.892495
Cameroon 2002 1.684886e+06 -0.892490
Afghanistan 2003 2.364851e+06 -0.892409
Burundi 2011 9.435800e+04 -0.892371
Myanmar 2011 5.553310e+05 -0.892292
Timor-Leste 2008 1.781100e+04 -0.892037
Sudan 2004 3.186341e+06 -0.891753
Togo 2013 7.429480e+05 -0.891690
Malaysia 2014 3.228170e+05 -0.891553
Paraguay 2008 6.471170e+05 -0.891535
Botswana 2010 2.148660e+05 -0.891475
Tunisia 2007 1.298870e+05 -0.891411
Zambia 2008 1.382517e+06 -0.891363
Haiti 2015 1.711610e+05 -0.891151
Italy 2014 6.789140e+05 -0.891094
Uzbekistan 2013 3.243200e+04 -0.891075
Bulgaria 2001 8.914200e+04 -0.890914
Ethiopia 2002 7.497192e+06 -0.890540
Cyprus 2009 1.987600e+04 -0.890528
Argentina 2009 4.799470e+05 -0.890483
Lesotho 2009 2.192900e+04 -0.890351
India 2002 1.898711e+07 -0.889274
Comoros 2011 7.656900e+04 -0.888981
Nicaragua 2015 6.823500e+04 -0.888868
Switzerland 2013 8.893460e+05 -0.888788
Cabo Verde 2011 5.867000e+03 -0.888000
Djibouti 2007 8.942000e+03 -0.887703
Benin 2001 7.767330e+05 -0.886872
Armenia 2004 3.612000e+03 -0.886365
Dominican Republic 2014 1.458440e+05 -0.886175
Iceland 2006 3.378200e+04 -0.886154
Cyprus 2006 1.455900e+04 -0.885953
Equatorial Guinea 2009 9.911100e+04 -0.885872
Luxembourg 2010 5.695300e+04 -0.885587
Maldives 2003 3.400000e+01 -0.885522
Guinea 2010 1.794170e+05 -0.884732
Angola 2006 2.262399e+06 -0.884291
Mauritania 2004 3.428230e+05 -0.884069
Georgia 2007 4.820000e+02 -0.883462
Mauritania 2014 4.639200e+04 -0.882438
Burundi 2015 1.199270e+05 -0.878761
Liberia 2002 3.628630e+05 -0.878687
Vanuatu 2004 2.414300e+04 -0.878656
Uganda 2007 3.594870e+05 -0.878373
Rwanda 2011 1.516710e+05 -0.878356
Vanuatu 2005 2.937000e+03 -0.878350
Ghana 2003 2.446782e+06 -0.877197
Chad 2007 1.775780e+05 -0.875086
Rwanda 2010 1.246842e+06 -0.875034
Lithuania 2010 3.972820e+05 -0.874394
Cameroon 2011 2.524470e+05 -0.873618
Mozambique 2005 2.923700e+04 -0.873583
Timor-Leste 2005 1.264840e+05 -0.873097
Madagascar 2009 2.569121e+06 -0.871521
Latvia 2010 2.975550e+05 -0.861064
Kiribati 2009 1.568000e+03 -0.840715
Hungary 2005 1.876500e+04 -0.839815
South Sudan 2010 1.671920e+05 -0.827222
Tunisia 2013 1.114558e+06 -0.409245
Belgium 2011 1.147744e+06 -0.394518
Equatorial Guinea 2012 1.385930e+05 0.393889
South Sudan 2013 1.117749e+06 -0.385264
Kiribati 2014 1.145800e+04 -0.381818
India 2003 1.182785e+07 -0.377059
Guinea 2011 1.135170e+05 -0.367301
Romania 2007 2.882982e+06 0.360298
Syrian Arab Republic 2011 2.863993e+06 0.351684
Rwanda 2013 1.165151e+06 -0.348660
Equatorial Guinea 2013 1.837460e+05 0.325796
Angola 2007 2.997687e+06 0.325004
Angola 2008 2.175942e+06 -0.274126
Botswana 2013 2.128570e+05 -0.264273
South Sudan 2012 1.818258e+06 0.254960
Syrian Arab Republic 2010 2.118834e+06 -0.249942
Afghanistan 2012 3.696958e+06 0.241173
Senegal 2002 1.396861e+06 0.231260
Sudan 2005 3.911914e+06 0.227714
Namibia 2008 2.163750e+05 -0.226997
Benin 2015 1.575952e+06 0.224790
Ghana 2004 2.986536e+06 0.220598
Syrian Arab Republic 2009 2.824893e+06 0.214776
Belize 2008 3.616500e+04 0.211680
Zambia 2001 1.824125e+06 0.191288
Syrian Arab Republic 2008 2.325443e+06 0.184465
Madagascar 2010 2.115164e+06 -0.176697
Guinea 2009 1.556524e+06 0.176385
Timor-Leste 2006 1.486210e+05 0.175018
Lithuania 2011 3.281150e+05 -0.174101
Botswana 2011 2.513390e+05 0.169748
Uzbekistan 2014 3.757700e+04 0.158640
Kiribati 2011 1.465600e+04 0.158394
Canada 2001 3.181900e+04 -0.155928
Lebanon 2008 4.111470e+05 -0.154829
Botswana 2012 2.893150e+05 0.151095
Senegal 2001 1.134497e+06 0.147751
South Africa 2010 5.979432e+06 0.137680
Tunisia 2008 1.473360e+05 0.134340
Kiribati 2012 1.661300e+04 0.133529
Peru 2013 3.565716e+06 0.128760
Australia 2006 2.697900e+04 0.126566
Australia 2005 2.394800e+04 0.125693
Sierra Leone 2014 7.791620e+05 0.125503
Haiti 2012 1.289210e+05 0.125417
Colombia 2001 4.988990e+05 0.123753
Romania 2008 2.537875e+06 -0.119705
Lesotho 2010 2.455100e+04 0.119568
Norway 2013 5.796230e+05 0.117727
Kiribati 2013 1.853500e+04 0.115693
Serbia 2015 7.953830e+05 0.114644
Peru 2014 3.973354e+06 0.114321
Timor-Leste 2007 1.649730e+05 0.110025
Cyprus 2008 1.815630e+05 0.109039
Solomon Islands 2008 5.447700e+04 0.105145
Namibia 2006 2.557340e+05 0.101371
Suriname 2007 5.975000e+03 0.098952
Haiti 2014 1.572466e+06 0.098263
Lesotho 2012 2.899280e+05 0.097522
Argentina 2008 4.382389e+06 0.096358
Namibia 2007 2.799150e+05 0.094555
Latvia 2013 2.126470e+05 -0.092489
Philippines 2008 9.751864e+06 0.092114
Hungary 2007 1.557800e+04 -0.090973
Mexico 2003 1.564453e+06 0.089780
Israel 2006 7.537000e+03 0.087433
Mexico 2005 1.847223e+07 0.086884
Hungary 2006 1.713700e+04 -0.086757
Israel 2013 8.595000e+03 0.085913
Slovenia 2009 2.396690e+05 0.082927
Ethiopia 2011 9.467560e+05 0.079211
Timor-Leste 2009 1.922100e+04 0.079165
Tunisia 2010 1.639931e+06 0.077602
Zimbabwe 2010 1.486317e+06 0.075795
Tunisia 2011 1.761467e+06 0.074110
Lebanon 2013 5.276120e+05 0.073159
Sri Lanka 2014 2.771000e+03 0.071954
Sri Lanka 2011 2.271000e+03 0.071732
Sierra Leone 2015 7.237250e+05 -0.071150
Tunisia 2012 1.886668e+06 0.071078
Swaziland 2001 1.729270e+05 0.070968
Albania 2001 3.617300e+04 -0.070748
Sri Lanka 2015 2.966000e+03 0.070372
Sri Lanka 2012 2.425000e+03 0.067812
Lebanon 2014 5.632790e+05 0.067601
Romania 2009 2.367487e+06 -0.067138
Guatemala 2008 1.463660e+05 0.066139
Sri Lanka 2013 2.585000e+03 0.065979
Malta 2005 4.383400e+04 0.062179
Lebanon 2011 4.588368e+06 0.057925
Tonga 2007 1.235700e+04 0.057148
Malawi 2005 1.339711e+06 0.056856
Malta 2008 4.937900e+04 0.056823
Portugal 2001 1.362722e+06 0.056457
Vanuatu 2009 2.378500e+04 0.055516
Belgium 2010 1.895586e+06 0.055159
Jordan 2012 7.992573e+06 0.055133
Philippines 2009 9.222879e+06 -0.054245
Jordan 2013 8.413464e+06 0.052660
Solomon Islands 2009 5.167900e+04 -0.051361
Jordan 2009 6.821116e+06 0.051048
Romania 2010 2.246871e+06 -0.050947
Gabon 2005 1.431260e+05 0.049119
Sierra Leone 2003 5.199549e+06 0.048885
Lebanon 2002 3.522837e+06 0.048507
Pakistan 2010 1.756182e+06 0.048493
Australia 2007 2.827600e+04 0.048074
Maldives 2007 3.490000e+02 0.048048
Jordan 2008 6.489822e+06 0.047896
Albania 2005 3.114870e+05 -0.047263
Israel 2007 7.181000e+03 -0.047234
Kenya 2004 3.574931e+06 0.047184
Equatorial Guinea 2008 8.684180e+05 0.047136
Austria 2002 8.819570e+05 0.047091
Belgium 2006 1.547958e+06 0.046896
Equatorial Guinea 2007 8.293270e+05 0.046843
Norway 2012 5.185730e+05 0.046802
Sierra Leone 2004 5.439695e+06 0.046186
Equatorial Guinea 2006 7.922170e+05 0.046084
Sierra Leone 2002 4.957216e+06 0.046014
Equatorial Guinea 2011 9.942900e+04 0.045367
Equatorial Guinea 2005 7.573170e+05 0.044839
Romania 2011 2.147528e+06 -0.044214
El Salvador 2006 6.564780e+05 0.043750
Jordan 2007 6.193191e+06 0.043638
Equatorial Guinea 2004 7.248170e+05 0.043486
Liberia 2008 3.662993e+06 0.042717
Portugal 2002 1.419631e+06 0.041761
Equatorial Guinea 2001 6.397620e+05 0.041410
El Salvador 2007 6.834750e+05 0.041124
Equatorial Guinea 2015 1.175389e+06 0.040698
Liberia 2007 3.512932e+06 0.040610
Liberia 2009 3.811528e+06 0.040550
Guatemala 2006 1.339780e+05 -0.040465
Belgium 2005 1.478617e+06 0.040446
Equatorial Guinea 2010 9.511400e+04 -0.040329
Sierra Leone 2005 5.658379e+06 0.040202
Niger 2013 1.842637e+07 0.039181
Niger 2014 1.914822e+07 0.039175
Niger 2015 1.989696e+07 0.039103
Azerbaijan 2011 9.173820e+05 -0.038718
Niger 2010 1.642558e+07 0.038679
Jordan 2006 5.934232e+06 0.038522
Lebanon 2001 3.359859e+06 0.038479
Niger 2009 1.581391e+07 0.038440
Sierra Leone 2001 4.739147e+06 0.038308
Niger 2008 1.522852e+07 0.038190
Portugal 2013 1.457295e+06 -0.037990
Austria 2001 8.422930e+05 0.037861
South Sudan 2002 7.237276e+06 0.037685
Central African Republic 2002 3.976120e+05 0.037547
Maldives 2006 3.330000e+02 0.037383
Albania 2004 3.269390e+05 -0.037327
Maldives 2008 3.620000e+02 0.037249
Mauritania 2010 3.695430e+05 0.037203
Slovenia 2010 2.485830e+05 0.037193
Niger 2004 1.312712e+06 0.037154
Greece 2014 1.892413e+06 -0.037043
Liberia 2001 2.991132e+06 0.036959
Niger 2001 1.177198e+07 0.036907
Tonga 2015 1.636400e+04 0.036877
Lebanon 2010 4.337141e+06 0.036811
Angola 2005 1.955254e+07 0.036406
Angola 2011 2.421856e+07 0.036349
Angola 2010 2.336913e+07 0.036346
Montenegro 2002 6.982800e+04 0.036193
Liberia 2010 3.948125e+06 0.035838
Angola 2014 2.692466e+06 0.035630
Uganda 2006 2.955662e+06 0.035478
Rwanda 2008 9.781690e+05 0.035382
Uganda 2003 2.662482e+06 0.035241
Swaziland 2010 1.228430e+05 0.035121
Timor-Leste 2002 9.238250e+05 0.035062
Uganda 2009 3.277190e+07 0.034993
Uganda 2010 3.391513e+07 0.034885
Angola 2015 2.785935e+06 0.034715
Angola 2002 1.757265e+07 0.034704
Burundi 2007 7.939573e+06 0.034426
Gabon 2013 1.817271e+06 0.034411
Burundi 2008 8.212264e+06 0.034346
Eritrea 2003 3.738265e+06 0.034201
Pakistan 2007 1.633297e+07 0.034126
Uganda 2014 3.883334e+07 0.034074
Burundi 2006 7.675338e+06 0.033954
Syrian Arab Republic 2006 1.891498e+07 0.033910
Mali 2007 1.367566e+06 0.033870
Iraq 2013 3.388314e+07 0.033761
Burkina Faso 2010 1.565217e+06 0.033693
Sierra Leone 2006 5.848692e+06 0.033634
Eritrea 2002 3.614639e+06 0.033603
Burundi 2005 7.423289e+06 0.033531
Chad 2014 1.356944e+07 0.033186
Afghanistan 2011 2.978599e+06 0.033100
Burundi 2004 7.182451e+06 0.032983
Gabon 2009 1.586754e+06 0.032767
Belgium 2004 1.421137e+06 0.032703
Burundi 2010 8.766930e+05 0.032702
Belgium 2003 1.376133e+06 0.032524
South Sudan 2014 1.153971e+06 0.032406
Jordan 2005 5.714111e+06 0.032249
Eritrea 2004 3.858623e+06 0.032196
Gabon 2014 1.875713e+06 0.032159
Lebanon 2005 3.986852e+06 0.031990
Madagascar 2001 1.626932e+06 0.031868
Gabon 2008 1.536411e+06 0.031707
Honduras 2007 7.779720e+05 0.031593
Zambia 2014 1.562974e+06 0.031447
Burundi 2003 6.953113e+06 0.031379
Chad 2010 1.188722e+06 0.031173
Tonga 2011 1.457700e+04 0.031124
Belize 2001 2.549840e+05 0.031009
Malawi 2010 1.516795e+06 0.030808
Togo 2010 6.529520e+05 0.030751
Gabon 2007 1.489193e+06 0.030695
Madagascar 2003 1.727914e+07 0.030660
Mali 2002 1.163893e+07 0.030609
Zambia 2012 1.469994e+07 0.030507
Mozambique 2003 1.971660e+07 0.030144
Mozambique 2007 2.218839e+07 0.029745
Mauritania 2013 3.946170e+05 0.029689
Mozambique 2008 2.284676e+07 0.029672
Mozambique 2002 1.913966e+07 0.029636
Senegal 2015 1.497699e+07 0.029622
Mozambique 2010 2.422145e+06 0.029621
Tonga 2014 1.578200e+04 0.029619
Mali 2014 1.696285e+07 0.029435
Mozambique 2014 2.721238e+07 0.029432
Benin 2006 8.216896e+06 0.029399
Ethiopia 2001 6.849226e+07 0.029381
Ireland 2007 4.398942e+06 0.029332
Mauritania 2003 2.957117e+06 0.029197
Iraq 2001 2.425165e+07 0.029120
Swaziland 2006 1.125140e+05 -0.028989
Madagascar 2008 1.999647e+07 0.028968
Benin 2007 8.454791e+06 0.028952
Australia 2010 2.231750e+05 0.028850
Maldives 2005 3.210000e+02 0.028846
Belize 2004 2.768900e+04 0.028834
Israel 2008 7.388000e+03 0.028826
Benin 2008 8.696916e+06 0.028638
Uzbekistan 2010 2.856240e+05 0.028631
Iraq 2002 2.493930e+07 0.028355
Burundi 2002 6.741569e+06 0.028332
Belgium 2001 1.286570e+05 0.028228
Ethiopia 2005 7.672783e+06 0.028186
Malawi 2001 1.169586e+07 0.028102
Togo 2001 5.111770e+05 0.027766
Ethiopia 2006 7.885689e+06 0.027748
Madagascar 2012 2.234657e+07 0.027715
Cameroon 2007 1.839539e+07 0.027701
Zambia 2007 1.272597e+07 0.027660
Portugal 2003 1.458821e+06 0.027606
Cameroon 2005 1.742795e+06 0.027603
Iraq 2003 2.562763e+07 0.027600
Madagascar 2013 2.296115e+07 0.027502
Ghana 2009 2.393831e+06 0.027455
Togo 2008 6.161796e+06 0.027414
Senegal 2007 1.187356e+07 0.027412
Namibia 2014 2.379920e+05 0.027369
Madagascar 2015 2.423488e+06 0.027345
Togo 2007 5.997385e+06 0.027338
Ireland 2006 4.273591e+06 0.027327
Kenya 2012 4.364663e+07 0.027298
Maldives 2011 3.770000e+02 0.027248
Uzbekistan 2011 2.933940e+05 0.027204
Togo 2006 5.837792e+06 0.027189
Senegal 2006 1.155676e+07 0.027152
Zambia 2004 1.173175e+07 0.027120
Kenya 2013 4.482685e+07 0.027040
Togo 2012 6.859482e+06 0.026979
Solomon Islands 2002 4.352620e+05 0.026917
Iraq 2009 2.989465e+07 0.026905
Togo 2005 5.683268e+06 0.026862
Ethiopia 2009 8.541625e+07 0.026824
Norway 2007 4.791530e+05 0.026734
Samoa 2014 1.922900e+04 -0.026725
Nigeria 2007 1.464172e+07 0.026661
Guinea-Bissau 2013 1.681495e+06 0.026467
Solomon Islands 2003 4.467690e+05 0.026437
Ethiopia 2013 9.488772e+07 0.026433
Sudan 2003 2.943594e+07 0.026373
Mauritius 2003 1.213370e+05 -0.026352
Guinea-Bissau 2014 1.725744e+06 0.026315
Guinea-Bissau 2012 1.638139e+06 0.026304
Belize 2003 2.691300e+04 0.026195
Nigeria 2005 1.389395e+08 0.026189
Ethiopia 2014 9.736677e+07 0.026126
Senegal 2009 1.255917e+06 0.026112
Cameroon 2001 1.567193e+07 0.026037
Nigeria 2004 1.353936e+08 0.025923
Solomon Islands 2004 4.583240e+05 0.025863
Papua New Guinea 2001 5.716152e+06 0.025830
Vanuatu 2003 1.989640e+05 0.025820
Guinea-Bissau 2005 1.388380e+05 0.025808
Jordan 2004 5.535595e+06 0.025723
Tonga 2012 1.495100e+04 0.025657
Pakistan 2006 1.579399e+07 0.025606
Greece 2008 1.177841e+06 0.025571
Papua New Guinea 2002 5.862316e+06 0.025570
Gabon 2001 1.262259e+06 0.025292
Solomon Islands 2005 4.698850e+05 0.025225
Tonga 2013 1.532800e+04 0.025216
Mauritania 2009 3.562880e+05 0.025168
Iceland 2012 3.271600e+04 0.025130
Rwanda 2015 1.162955e+07 0.025050
Liberia 2013 4.286291e+06 0.025045
Papua New Guinea 2006 6.472720e+05 0.025010
Haiti 2006 9.494570e+05 0.024951
Comoros 2001 5.558880e+05 0.024949
Rwanda 2002 8.536250e+05 0.024826
Vanuatu 2007 2.199530e+05 0.024782
Guatemala 2007 1.372860e+05 0.024691
Solomon Islands 2006 4.814220e+05 0.024553
Comoros 2002 5.694790e+05 0.024449
Comoros 2010 6.896920e+05 0.024419
Guinea-Bissau 2004 1.353450e+05 0.024394
Comoros 2009 6.732520e+05 0.024380
Sao Tome and Principe 2004 1.519690e+05 0.024243
Comoros 2006 6.264250e+05 0.024194
Comoros 2003 5.832110e+05 0.024113
Comoros 2005 6.116270e+05 0.024110
Sudan 2014 3.773791e+07 0.024098
Sao Tome and Principe 2003 1.483720e+05 0.024039
Comoros 2004 5.972280e+05 0.024034
Papua New Guinea 2008 6.787187e+06 0.024029
Timor-Leste 2014 1.212814e+06 0.024020
Ghana 2010 2.451214e+06 0.023971
Israel 2001 6.439000e+03 0.023851
Comoros 2015 7.774240e+05 0.023755
Bhutan 2006 6.722280e+05 0.023741
Zimbabwe 2015 1.577745e+07 0.023734
Papua New Guinea 2009 6.947447e+06 0.023612
Vanuatu 2011 2.418710e+05 0.023598
Guinea-Bissau 2007 1.445958e+06 0.023565
Ghana 2014 2.696256e+07 0.023393
Vanuatu 2012 2.474850e+05 0.023211
Sierra Leone 2013 6.922790e+05 0.023154
Ecuador 2004 1.359647e+06 0.023090
Sao Tome and Principe 2002 1.448890e+05 0.023068
Ghana 2015 2.758282e+07 0.023004
Vanuatu 2013 2.531420e+05 0.022858
Tajikistan 2013 8.177890e+05 0.022796
Honduras 2006 7.541460e+05 0.022789
Sudan 2011 3.516731e+07 0.022723
Guinea 2013 1.153662e+07 0.022616
Seychelles 2015 9.341900e+04 0.022548
Sao Tome and Principe 2015 1.955530e+05 0.022414
Vanuatu 2015 2.646300e+04 0.022330
Tajikistan 2015 8.548651e+06 0.022230
Guatemala 2010 1.463417e+06 0.022205
Malaysia 2001 2.369897e+06 0.022138
Solomon Islands 2012 5.515310e+05 0.022084
Paraguay 2002 5.586110e+05 0.021929
Bhutan 2007 6.869580e+05 0.021912
Belize 2014 3.516940e+05 0.021829
Namibia 2012 2.263934e+06 0.021806
Solomon Islands 2013 5.635130e+05 0.021725
Central African Republic 2006 4.217580e+05 0.021723
Belize 2015 3.592880e+05 0.021593
Philippines 2001 7.966532e+07 0.021461
Papua New Guinea 2014 7.755785e+06 0.021457
Pakistan 2012 1.779115e+08 0.021398
Pakistan 2013 1.817126e+08 0.021365
Solomon Islands 2010 5.277900e+04 0.021285
Azerbaijan 2008 8.763400e+04 0.021221
Maldives 2012 3.850000e+02 0.021220
Papua New Guinea 2015 7.919825e+06 0.021151
Brazil 2002 1.815121e+06 0.021127
Pakistan 2014 1.855463e+08 0.021097
Djibouti 2001 7.327110e+05 0.021080
Tajikistan 2005 6.854176e+06 0.021054
Guatemala 2014 1.592356e+07 0.020989
Maldives 2001 2.920000e+02 0.020979
Sudan 2008 3.295550e+07 0.020846
Guinea 2006 9.881428e+06 0.020836
Australia 2009 2.169170e+05 0.020824
Maldives 2013 3.930000e+02 0.020779
Tajikistan 2004 6.712841e+06 0.020673
Jordan 2003 5.396774e+06 0.020669
Guatemala 2015 1.625243e+07 0.020653
Ireland 2008 4.489544e+06 0.020596
Cabo Verde 2003 4.614700e+04 0.020590
Algeria 2013 3.833856e+07 0.020570
Ecuador 2007 1.425453e+06 0.020551
Guinea-Bissau 2002 1.293523e+06 0.020521
Kenya 2005 3.648288e+06 0.020520
Malaysia 2013 2.976724e+06 0.020315
Algeria 2012 3.756585e+07 0.020269
Algeria 2014 3.911331e+07 0.020208
Tajikistan 2003 6.576877e+06 0.020036
Malaysia 2009 2.765383e+06 0.019997
Vanuatu 2001 1.892900e+04 0.019717
Malaysia 2004 2.517419e+06 0.019663
Rwanda 2005 8.991735e+06 0.019652
Guinea-Bissau 2001 1.267512e+06 0.019532
China 2006 1.311200e+04 -0.019444
Algeria 2011 3.681956e+07 0.019434
Algeria 2015 3.987153e+07 0.019385
Liberia 2004 3.176414e+06 0.019312
Eritrea 2008 4.232636e+06 0.019094
Mongolia 2012 2.814226e+06 0.019087
Tajikistan 2002 6.447688e+06 0.019055
Malaysia 2006 2.614357e+07 0.018869
Philippines 2005 8.627424e+07 0.018845
Iceland 2008 3.174140e+05 0.018770
Swaziland 2012 1.248158e+06 0.018690
Turkmenistan 2013 5.366277e+06 0.018687
Swaziland 2013 1.271456e+06 0.018666
Israel 2011 7.765800e+04 0.018653
Turkmenistan 2014 5.466241e+06 0.018628
Malaysia 2011 2.863513e+07 0.018598
Bangladesh 2002 1.366667e+06 0.018596
Guinea 2002 9.137345e+06 0.018527
Bhutan 2010 7.276410e+05 0.018452
Malaysia 2007 2.662584e+07 0.018447
Djibouti 2002 7.462210e+05 0.018438
Israel 2010 7.623600e+04 0.018435
Kiribati 2015 1.124700e+04 -0.018415
Nepal 2010 2.723137e+06 0.018333
Panama 2007 3.453870e+05 0.018255
Spain 2003 4.218764e+07 0.018249
Germany 2012 8.425823e+06 0.018228
Honduras 2005 7.373430e+05 0.018214
Costa Rica 2001 3.996798e+06 0.018178
Mongolia 2015 2.976877e+06 0.018120
Turkmenistan 2015 5.565284e+06 0.018119
Jordan 2002 5.287488e+06 0.018101
Dominican Republic 2004 9.129980e+05 0.018089
Namibia 2001 1.933596e+06 0.018080
Swaziland 2008 1.158897e+06 0.017975
Panama 2009 3.579385e+06 0.017950
Romania 2002 2.173496e+06 -0.017938
Swaziland 2015 1.319110e+05 0.017855
Panama 2010 3.643222e+06 0.017835
Kiribati 2003 8.889500e+04 0.017769
Ecuador 2001 1.285276e+07 0.017750
Djibouti 2012 8.811850e+05 0.017609
Djibouti 2013 8.966880e+05 0.017593
Nepal 2004 2.539449e+06 0.017561
Spain 2004 4.292190e+07 0.017404
Djibouti 2011 8.659370e+05 0.017378
Panama 2013 3.838462e+06 0.017367
Kiribati 2002 8.734300e+04 0.017296
Cambodia 2003 1.285312e+07 0.017285
Djibouti 2014 9.121640e+05 0.017259
Rwanda 2003 8.683460e+05 0.017245
Botswana 2009 1.979882e+06 0.017228
Panama 2006 3.391950e+05 0.017183
Algeria 2009 3.546576e+06 0.017168
Portugal 2004 1.483861e+06 0.017165
Slovenia 2011 2.528430e+05 0.017137
Maldives 2002 2.970000e+02 0.017123
Uzbekistan 2008 2.732800e+04 0.017121
Australia 2013 2.311735e+07 0.017120
Slovenia 2012 2.571590e+05 0.017070
Spain 2006 4.439732e+07 0.017047
Spain 2005 4.365316e+07 0.017037
Ireland 2002 3.931947e+06 0.016994
Cyprus 2003 9.935630e+05 0.016988
Ecuador 2009 1.469128e+07 0.016869
Nepal 2002 2.456634e+07 0.016744
Djibouti 2015 9.274140e+05 0.016718
Botswana 2008 1.946351e+06 0.016682
Mexico 2009 1.155523e+07 0.016632
Djibouti 2003 7.586150e+05 0.016609
Greece 2005 1.987314e+06 0.016456
Cambodia 2012 1.477687e+07 0.016438
Ireland 2003 3.996521e+06 0.016423
Latvia 2009 2.141669e+06 -0.016375
Haiti 2002 8.834733e+06 0.016355
Turkey 2013 7.578733e+07 0.016327
Luxembourg 2004 4.589500e+04 0.016208
Spain 2008 4.595416e+06 0.016082
Luxembourg 2006 4.726370e+05 0.016078
Haiti 2003 8.976552e+06 0.016052
Botswana 2007 1.914414e+06 0.016015
India 2005 1.144119e+09 0.015969
Ecuador 2012 1.541967e+07 0.015965
Bhutan 2013 7.649610e+05 0.015929
Philippines 2007 8.929349e+06 0.015920
Haiti 2004 9.119178e+06 0.015889
Dominican Republic 2001 8.697126e+06 0.015708
Ecuador 2013 1.566155e+07 0.015687
Seychelles 2014 9.135900e+04 0.015676
South Africa 2014 5.414673e+07 0.015658
India 2006 1.161978e+09 0.015609
Luxembourg 2007 4.799930e+05 0.015564
Dominican Republic 2002 8.832285e+06 0.015541
Botswana 2006 1.884238e+06 0.015295
Mongolia 2009 2.668289e+06 0.015280
India 2007 1.179681e+09 0.015236
Cyprus 2015 1.169850e+05 0.015151
Paraguay 2006 5.882796e+06 0.015064
Bhutan 2014 7.764480e+05 0.015016
Cambodia 2007 1.367669e+07 0.015006
Morocco 2015 3.483322e+06 0.014989
Uzbekistan 2012 2.977450e+05 0.014830
Namibia 2002 1.962147e+06 0.014766
Cabo Verde 2005 4.745670e+05 0.014761
Kazakhstan 2015 1.754413e+07 0.014743
Morocco 2013 3.382477e+07 0.014729
Algeria 2006 3.377792e+07 0.014704
Slovenia 2008 2.213160e+05 0.014643
Bangladesh 2005 1.434311e+07 0.014544
Dominican Republic 2006 9.371338e+06 0.014481
Morocco 2012 3.333379e+07 0.014455
Canada 2002 3.136200e+04 -0.014362
India 2009 1.214271e+08 0.014304
Botswana 2002 1.779953e+06 0.014256
Paraguay 2007 5.966159e+06 0.014171
Ukraine 2010 4.587700e+04 -0.014098
Bhutan 2015 7.873860e+05 0.014087
Indonesia 2002 2.175859e+06 0.014078
Colombia 2003 4.215215e+07 0.013943
Turkey 2004 6.778550e+05 0.013868
Turkey 2010 7.232691e+07 0.013846
Indonesia 2006 2.298382e+07 0.013786
South Africa 2008 4.955757e+07 0.013782
Algeria 2003 3.243514e+06 0.013742
Finland 2002 5.259800e+04 0.013683
Indonesia 2008 2.361593e+08 0.013606
Portugal 2007 1.542964e+06 0.013582
Australia 2001 1.941300e+04 0.013575
Paraguay 2012 6.379219e+06 0.013575
Colombia 2004 4.272416e+07 0.013570
Lesotho 2015 2.174645e+06 0.013450
Lesotho 2014 2.145785e+06 0.013424
Peru 2001 2.626136e+07 0.013370
Dominican Republic 2010 9.897985e+06 0.013332
Argentina 2003 3.839379e+06 0.013313
Albania 2011 2.951950e+05 0.013298
Paraguay 2015 6.639119e+06 0.013206
Honduras 2004 7.241530e+05 -0.013175
India 2012 1.263659e+08 0.013167
Peru 2011 2.975999e+07 0.013153
Colombia 2005 4.328563e+07 0.013142
Mongolia 2007 2.591670e+05 0.013115
Nicaragua 2008 5.594560e+05 0.013111
Georgia 2010 3.926000e+03 -0.013072
Georgia 2011 3.875000e+03 -0.012990
Georgia 2014 3.727000e+03 -0.012977
Guyana 2006 7.496100e+04 -0.012970
Ireland 2015 4.676835e+06 0.012910
Georgia 2012 3.825000e+03 -0.012903
Nicaragua 2007 5.522160e+05 0.012848
Latvia 2001 2.337170e+05 -0.012832
Georgia 2013 3.776000e+03 -0.012810
Switzerland 2008 7.647675e+06 0.012787
Colombia 2006 4.383572e+07 0.012708
Norway 2009 4.828726e+06 0.012691
Azerbaijan 2014 9.535790e+05 0.012635
Switzerland 2009 7.743831e+06 0.012573
Nicaragua 2010 5.737723e+06 0.012555
Norway 2010 4.889252e+06 0.012535
Peru 2004 2.727319e+07 0.012453
Morocco 2009 3.198990e+07 0.012439
Namibia 2003 1.986535e+06 0.012429
Australia 2003 1.989540e+05 0.012416
Brazil 2004 1.847385e+08 0.012365
Uzbekistan 2002 2.527185e+06 0.012314
Cabo Verde 2015 5.329130e+05 0.012302
Colombia 2007 4.437457e+07 0.012292
South Africa 2003 4.641819e+07 0.012271
Peru 2007 2.829272e+07 0.012264
Algeria 2004 3.283196e+06 0.012234
Canada 2012 3.475545e+06 0.012016
Uganda 2012 3.636796e+06 0.012007
South Africa 2002 4.585548e+07 0.011973
Chile 2001 1.544497e+07 0.011939
Suriname 2002 4.834400e+04 0.011931
Colombia 2011 4.646646e+06 0.011923
Germany 2011 8.274983e+06 0.011897
Morocco 2008 3.159686e+07 0.011880
Indonesia 2015 2.581621e+08 0.011880
Mongolia 2005 2.526446e+06 0.011861
Nepal 2006 2.594618e+06 0.011828
Nepal 2012 2.764992e+07 0.011812
Brazil 2005 1.869174e+08 0.011795
India 2013 1.278562e+08 0.011794
Zimbabwe 2001 1.236616e+07 0.011775
Nepal 2015 2.865628e+07 0.011759
Uzbekistan 2003 2.556765e+06 0.011705
Uzbekistan 2004 2.586435e+06 0.011605
Chile 2002 1.562364e+07 0.011568
Morocco 2001 2.918183e+07 0.011515
Canada 2009 3.362857e+07 0.011514
Switzerland 2015 8.282396e+06 0.011448
Zimbabwe 2004 1.277751e+07 0.011367
Brazil 2007 1.912664e+07 0.011327
Morocco 2002 2.951237e+07 0.011327
Suriname 2001 4.777400e+04 0.011325
Chile 2003 1.579954e+07 0.011259
Seychelles 2003 8.278100e+04 -0.011251
Morocco 2003 2.984394e+07 0.011235
Iceland 2014 3.273860e+05 0.011187
Mongolia 2004 2.496832e+06 0.011155
Argentina 2002 3.788937e+06 0.011149
Canada 2014 3.554456e+07 0.011068
Azerbaijan 2006 8.484550e+05 0.011046
Lithuania 2009 3.162916e+06 -0.011042
Chile 2004 1.597378e+07 0.011028
Latvia 2002 2.311730e+05 -0.010885
Canada 2008 3.324577e+07 0.010881
Slovenia 2013 2.599530e+05 0.010865
Cabo Verde 2008 4.917230e+05 0.010865
Latvia 2004 2.263122e+06 -0.010854
Argentina 2005 3.914549e+07 0.010762
Latvia 2005 2.238799e+06 -0.010748
Austria 2015 8.633169e+06 0.010723
Switzerland 2012 7.996861e+06 0.010675
Cabo Verde 2009 4.969630e+05 0.010656
Malta 2015 4.318740e+05 0.010553
Chile 2007 1.649169e+07 0.010533
Luxembourg 2002 4.461750e+05 0.010532
Argentina 2011 4.165688e+07 0.010503
Mongolia 2003 2.469286e+06 0.010487
Argentina 2007 3.997224e+06 0.010449
Argentina 2014 4.298152e+07 0.010381
Chile 2008 1.666194e+07 0.010324
Lithuania 2008 3.198231e+06 -0.010232
Ireland 2009 4.535375e+06 0.010208
Paraguay 2004 5.737400e+04 0.010195
Argentina 2015 4.341776e+07 0.010150
Canada 2004 3.199500e+04 0.010071
Lithuania 2013 2.957689e+06 -0.010069
Chile 2009 1.682944e+07 0.010053
Canada 2003 3.167600e+04 0.010012
Ukraine 2001 4.868386e+07 -0.010005
Guinea 2004 9.492290e+05 0.009981
Nepal 2008 2.647586e+07 0.009957
Lithuania 2002 3.443670e+05 -0.009922
Canada 2005 3.231200e+04 0.009908
Mongolia 2002 2.443659e+06 0.009870
Portugal 2008 1.558177e+06 0.009860
Colombia 2013 4.734298e+07 0.009844
Nepal 2005 2.564287e+06 0.009781
Brazil 2010 1.967963e+08 0.009750
Dominican Republic 2008 9.636520e+05 0.009744
Chile 2010 1.699335e+07 0.009740
Brazil 2011 1.986867e+08 0.009606
China 2012 1.356950e+05 0.009538
Colombia 2014 4.779191e+07 0.009483
Malta 2014 4.273640e+05 0.009424
Chile 2011 1.715336e+07 0.009416
Malta 2013 4.233740e+05 0.009343
Mongolia 2001 2.419776e+06 0.009318
South Africa 2001 4.531294e+07 0.009267
Myanmar 2014 5.192418e+07 0.009252
Lesotho 2001 1.885955e+06 0.009234
Fiji 2011 8.678600e+04 0.009198
Denmark 2004 5.445230e+05 0.009172
Colombia 2015 4.822870e+07 0.009139
Latvia 2006 2.218357e+06 -0.009131
Switzerland 2007 7.551117e+06 0.008977
Iceland 2002 2.875230e+05 0.008966
Brazil 2008 1.929793e+07 0.008956
Fiji 2007 8.348120e+05 0.008945
Suriname 2013 5.425400e+04 0.008870
Chile 2014 1.761380e+07 0.008636
Sweden 2010 9.378126e+06 0.008562
Honduras 2015 8.968290e+05 0.008561
Lithuania 2014 2.932367e+06 -0.008561
Sweden 2009 9.298515e+06 0.008555
Lesotho 2007 1.982287e+06 0.008458
Chile 2015 1.776268e+07 0.008453
Trinidad and Tobago 2004 1.295350e+05 0.008431
Lesotho 2006 1.965662e+06 0.008268
Belarus 2007 9.569530e+05 -0.008261
Lesotho 2005 1.949543e+06 0.008179
Latvia 2015 1.977527e+06 -0.008153
Samoa 2012 1.891940e+05 0.008147
Mauritius 2001 1.196287e+06 0.007932
Bulgaria 2003 7.775327e+06 -0.007890
Italy 2015 6.735820e+05 -0.007854
Malta 2012 4.194550e+05 0.007656
Tunisia 2003 9.939678e+06 0.007639
Botswana 2004 1.829330e+05 -0.007627
Sri Lanka 2009 1.996800e+04 0.007620
Sri Lanka 2001 1.879700e+04 0.007612
Switzerland 2002 7.284753e+06 0.007593
Sweden 2011 9.449213e+06 0.007580
Sri Lanka 2008 1.981700e+04 0.007576
Sri Lanka 2002 1.893900e+04 0.007554
Sri Lanka 2005 1.937300e+04 0.007541
Belarus 2010 9.495830e+05 -0.007507
Myanmar 2006 4.884647e+07 0.007505
Sweden 2012 9.519374e+06 0.007425
Austria 2014 8.541575e+06 0.007335
Poland 2010 3.842794e+06 0.007242
Luxembourg 2013 5.433600e+04 0.007229
Georgia 2001 4.386400e+04 -0.007220
Ukraine 2008 4.625820e+05 -0.007196
Denmark 2015 5.683483e+06 0.007089
Samoa 2009 1.848260e+05 0.007083
Fiji 2013 8.797150e+05 0.007004
France 2006 6.362138e+07 0.006996
Bulgaria 2008 7.492561e+06 -0.006995
Belarus 2003 9.796749e+06 -0.006974
Iceland 2003 2.895210e+05 0.006949
Austria 2005 8.227829e+06 0.006836
Fiji 2006 8.274110e+05 0.006807
Samoa 2008 1.835260e+05 0.006802
Albania 2009 2.927519e+06 -0.006716
Guyana 2015 7.685140e+05 0.006708
Myanmar 2007 4.917159e+07 0.006656
Estonia 2008 1.337900e+04 -0.006608
Malta 2003 3.985820e+05 0.006599
Bulgaria 2010 7.395599e+06 -0.006561
Guyana 2013 7.588100e+04 0.006499
Malawi 2015 1.757367e+06 -0.006485
Portugal 2009 1.568247e+06 0.006463
Switzerland 2005 7.437115e+06 0.006427
Bulgaria 2009 7.444443e+06 -0.006422
Samoa 2005 1.799290e+05 0.006421
Bulgaria 2011 7.348328e+06 -0.006392
Bulgaria 2015 7.177991e+06 -0.006360
Estonia 2001 1.388115e+06 -0.006349
Belarus 2002 9.865548e+06 -0.006345
Jamaica 2003 2.712511e+06 0.006331
Ukraine 2005 4.715150e+05 -0.006324
Samoa 2004 1.787810e+05 0.006298
Switzerland 2006 7.483934e+06 0.006295
Myanmar 2008 4.947975e+07 0.006267
Austria 2004 8.171966e+06 0.006223
Central African Republic 2011 4.476153e+06 0.006211
Romania 2005 2.131968e+07 -0.006156
Samoa 2003 1.776620e+05 0.006116
Tonga 2003 9.978900e+04 0.006100
Jamaica 2004 2.728777e+06 0.005997
Denmark 2008 5.493621e+06 0.005893
Norway 2003 4.564855e+06 0.005883
Austria 2013 8.479375e+06 0.005858
Jamaica 2005 2.744673e+06 0.005825
Tonga 2002 9.918400e+04 0.005811
Samoa 2002 1.765820e+05 0.005787
Romania 2004 2.145175e+07 -0.005682
Bulgaria 2014 7.223938e+06 -0.005668
Thailand 2007 6.619562e+07 0.005643
Norway 2002 4.538159e+06 0.005407
Poland 2011 3.863255e+06 0.005325
Mauritius 2015 1.262650e+05 -0.005270
France 2013 6.599857e+06 0.005160
China 2008 1.324655e+06 0.005137
Trinidad and Tobago 2012 1.341588e+06 0.005094
China 2015 1.371220e+05 0.005094
Denmark 2014 5.643475e+06 0.005083
China 2014 1.364270e+05 0.005076
El Salvador 2015 6.312478e+06 0.004981
Trinidad and Tobago 2013 1.348248e+06 0.004964
Austria 2006 8.268641e+06 0.004960
Italy 2005 5.796948e+07 0.004926
Belgium 2013 1.118282e+07 0.004904
China 2010 1.337750e+05 0.004875
Serbia 2012 7.199770e+05 -0.004868
Finland 2009 5.338871e+06 0.004794
Finland 2012 5.413971e+06 0.004769
China 2011 1.344130e+05 0.004769
Trinidad and Tobago 2009 1.321618e+06 0.004748
Mauritius 2006 1.233996e+06 0.004675
Finland 2011 5.388272e+06 0.004646
Seychelles 2005 8.285800e+04 0.004644
Trinidad and Tobago 2014 1.354493e+06 0.004632
Finland 2013 5.438972e+06 0.004618
Spain 2010 4.657690e+07 0.004615
Finland 2010 5.363352e+06 0.004585
Austria 2012 8.429991e+06 0.004570
Bosnia and Herzegovina 2009 3.746561e+06 -0.004527
El Salvador 2010 6.164626e+06 0.004456
Denmark 2007 5.461438e+06 0.004445
Fiji 2005 8.218170e+05 0.004232
Denmark 2013 5.614932e+06 0.004178
Finland 2014 5.461512e+06 0.004144
Ireland 2014 4.617225e+06 0.004117
Croatia 2014 4.238389e+06 -0.004065
Serbia 2007 7.381579e+06 -0.004046
Sweden 2004 8.993531e+06 0.003941
Seychelles 2009 8.729800e+04 0.003933
Armenia 2010 2.877311e+06 -0.003903
Netherlands 2008 1.644559e+07 0.003901
Jamaica 2014 2.862870e+05 0.003857
Sweden 2003 8.958229e+06 0.003728
Seychelles 2004 8.247500e+04 -0.003697
Fiji 2001 8.142180e+05 0.003692
Thailand 2011 6.753130e+05 0.003604
Denmark 2001 5.358783e+06 0.003590
Estonia 2012 1.322696e+06 -0.003573
Spain 2011 4.674270e+07 0.003560
Estonia 2013 1.317997e+06 -0.003553
Uruguay 2015 3.431552e+06 0.003511
Central African Republic 2014 4.515392e+06 0.003498
Uruguay 2010 3.374415e+06 0.003467
Iceland 2009 3.184990e+05 0.003418
Greece 2003 1.928700e+04 0.003382
Belize 2007 2.984700e+04 0.003362
Uruguay 2011 3.385624e+06 0.003322
France 2008 6.437499e+06 0.003315
Finland 2015 5.479531e+06 0.003299
Uruguay 2012 3.396777e+06 0.003294
Denmark 2006 5.437272e+06 0.003292
Austria 2007 8.295487e+06 0.003247
Denmark 2002 5.375931e+06 0.003200
Austria 2008 8.321496e+06 0.003135
Estonia 2011 1.327439e+06 -0.003031
Italy 2006 5.814398e+07 0.003010
Hungary 2013 9.893820e+05 -0.003003
Montenegro 2012 6.261000e+03 -0.002867
Croatia 2013 4.255689e+06 -0.002781
Argentina 2001 3.747159e+06 -0.002739
Italy 2012 5.953972e+07 0.002699
Sweden 2001 8.895960e+05 0.002679
Mauritius 2009 1.247429e+06 0.002659
Austria 2009 8.343323e+06 0.002623
Estonia 2014 1.314545e+06 -0.002619
Ukraine 2011 4.576100e+04 -0.002529
Ireland 2013 4.598294e+06 0.002485
Russian Federation 2013 1.435691e+07 0.002460
Japan 2001 1.271490e+05 0.002412
Germany 2009 8.192370e+05 -0.002387
Albania 2014 2.889140e+05 -0.002341
Netherlands 2005 1.631987e+07 0.002339
Japan 2002 1.274450e+05 0.002328
Iceland 2011 3.191400e+04 0.002293
Estonia 2010 1.331475e+06 -0.002278
Ukraine 2013 4.548960e+05 -0.002274
Ireland 2012 4.586897e+06 0.002207
Mauritius 2013 1.258653e+06 0.002206
Armenia 2012 2.881922e+06 0.002205
Turkey 2005 6.793460e+05 0.002200
Lesotho 2003 1.918970e+05 -0.002158
Montenegro 2009 6.182940e+05 0.002148
Japan 2003 1.277180e+05 0.002142
Fiji 2004 8.183540e+05 0.002114
Guyana 2008 7.463140e+05 -0.002079
Pakistan 2008 1.636446e+07 0.001928
Montenegro 2010 6.194280e+05 0.001834
Fiji 2002 8.156910e+05 0.001809
Montenegro 2008 6.169690e+05 0.001776
Montenegro 2004 6.133530e+05 0.001774
Russian Federation 2012 1.432168e+07 0.001735
Italy 2011 5.937945e+07 0.001721
Kazakhstan 2001 1.485834e+07 -0.001699
Germany 2002 8.248850e+07 0.001683
Japan 2012 1.276290e+05 -0.001596
Belarus 2015 9.489616e+06 0.001594
Montenegro 2005 6.142610e+05 0.001480
Japan 2013 1.274450e+05 -0.001442
Germany 2007 8.226637e+07 -0.001336
Japan 2014 1.272760e+05 -0.001326
Mauritania 2001 2.797290e+05 0.001324
Japan 2009 1.284700e+04 -0.001244
Guyana 2010 7.465560e+05 0.001157
Kazakhstan 2005 1.514729e+06 0.001153
Fiji 2003 8.166280e+05 0.001149
Germany 2006 8.237645e+07 -0.001127
Australia 2011 2.234240e+05 0.001116
Croatia 2009 4.429780e+05 -0.001082
Japan 2015 1.271410e+05 -0.001061
Belarus 2012 9.464495e+06 -0.000916
Belarus 2014 9.474511e+06 0.000899
Montenegro 2014 6.218100e+04 0.000869
Guyana 2009 7.456930e+05 -0.000832
Jamaica 2013 2.851870e+05 0.000684
Croatia 2005 4.442000e+03 0.000676
Uruguay 2003 3.325637e+06 -0.000642
Japan 2006 1.278540e+05 0.000634
Poland 2006 3.814127e+07 -0.000634
Armenia 2011 2.875581e+06 -0.000601
Germany 2003 8.253418e+07 0.000554
Bosnia and Herzegovina 2004 3.781287e+06 0.000540
Guyana 2002 7.518840e+05 -0.000504
Russian Federation 2010 1.428494e+08 0.000449
Poland 2005 3.816544e+07 -0.000439
Poland 2002 3.823364e+06 -0.000395
Japan 2004 1.277610e+05 0.000337
China 2013 1.357380e+05 0.000317
Guyana 2004 7.516520e+05 -0.000273
Belarus 2013 9.465997e+06 0.000159
Tunisia 2014 1.114398e+06 -0.000144
Suriname 2010 5.261300e+04 -0.000114
Japan 2005 1.277730e+05 0.000094
Kazakhstan 2002 1.485895e+07 0.000041
Guyana 2003 7.518570e+05 -0.000036
Poland 2012 3.863164e+06 -0.000024
Croatia 2002 4.440000e+02 0.000000
Croatia 2003 4.440000e+02 0.000000
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"])
df.loc[df["Population"] <= 0, "Population"] = np.nan
prev = df.groupby("Country")["Population"].shift(1)
ratio = df["Population"] / prev
spike = df["Population"].notna() & prev.notna() & ((ratio < 0.7) | (ratio > 1.3))
bad_countries = df.loc[spike, "Country"].unique().tolist()
print("Broj država sa spike-ovima:", len(bad_countries))
print("Primer:", bad_countries[:30])
Broj država sa spike-ovima: 143 Primer: ['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia']
dataframe.loc[dataframe["Country"].isin(bad_countries), "Population"] = np.nan
pop = pd.to_numeric(dataframe["Population"])
missing_count = pop.isna().sum()
missing_percent = pop.isna().mean() * 100
print("Nedostajuci redovi:", missing_count)
print("Nedostajuci % feature-a Population", round(missing_percent,2), "%")
Nedostajuci redovi: 2936 Nedostajuci % feature-a Population 99.93 %
url = "https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?date=2000:2015&format=json&per_page=20000"
r = requests.get(url)
data = r.json()[1]
pop = pd.DataFrame([{
"Country": d["country"]["value"],
"Year": int(d["date"]),
"PopulationWB": d["value"]
} for d in data if d["value"] is not None])
print(pop.head())
Country Year PopulationWB 0 Africa Eastern and Southern 2015 607123269 1 Africa Eastern and Southern 2014 590968990 2 Africa Eastern and Southern 2013 575202699 3 Africa Eastern and Southern 2012 559609961 4 Africa Eastern and Southern 2011 544737983
your = set(dataframe["Country"].unique())
worldBankData = set(pop["Country"].unique())
print(your - worldBankData)
{'Bahamas', 'Republic of Korea', "Côte d'Ivoire", 'Gambia', 'Saint Kitts and Nevis', 'Iran (Islamic Republic of)', 'Saint Lucia', 'Slovakia', 'Democratic Republic of the Congo', 'Egypt', 'The former Yugoslav republic of Macedonia', 'Congo', 'Micronesia (Federated States of)', 'Niue', "Democratic People's Republic of Korea", 'United States of America', "Lao People's Democratic Republic", 'Turkey', 'Venezuela (Bolivarian Republic of)', 'United Kingdom of Great Britain and Northern Ireland', 'Yemen', 'Kyrgyzstan', 'United Republic of Tanzania', 'Republic of Moldova', 'Somalia', 'Swaziland', 'Bolivia (Plurinational State of)', 'Cook Islands', 'Saint Vincent and the Grenadines'}
name_map = {
"Bahamas": "Bahamas, The",
"Bolivia (Plurinational State of)": "Bolivia",
"Côte d'Ivoire": "Cote d'Ivoire",
"Congo": "Congo, Rep.",
"Democratic Republic of the Congo": "Congo, Dem. Rep.",
"Democratic People's Republic of Korea": "Korea, Dem. People's Rep.",
"Egypt": "Egypt, Arab Rep.",
"Iran (Islamic Republic of)": "Iran, Islamic Rep.",
"Gambia": "Gambia, The",
"Kyrgyzstan": "Kyrgyz Republic",
"Lao People's Democratic Republic": "Lao PDR",
"United Republic of Tanzania": "Tanzania",
"Micronesia (Federated States of)": "Micronesia, Fed. Sts.",
"Republic of Korea": "Korea, Rep.",
"Republic of Moldova": "Moldova",
"Saint Vincent and the Grenadines": "St. Vincent and the Grenadines",
"Saint Lucia": "St. Lucia",
"Slovakia": "Slovak Republic",
"Venezuela (Bolivarian Republic of)": "Venezuela, RB",
"United States of America": "United States",
"The former Yugoslav republic of Macedonia": "North Macedonia",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"Yemen": "Yemen, Rep.",
"Saint Kitts and Nevis": "St. Kitts and Nevis",
"Swaziland": "Eswatini",
"Turkey": "Turkiye"
}
df = dataframe.copy()
df["Year"] = pd.to_numeric(df["Year"]).astype(int)
df["Country_wb"] = df["Country"].str.strip().replace(name_map)
copyDataframe = pop.copy()
copyDataframe["Year"] = pd.to_numeric(copyDataframe["Year"]).astype(int)
merged = df.merge(
copyDataframe.rename(columns={"Country": "Country_wb"}),
on=["Country_wb", "Year"],
how="left"
)
merged["Population"] = pd.to_numeric(merged["Population"], errors="coerce")
merged["Population"] = merged["Population"].fillna(merged["PopulationWB"])
merged.loc[merged["Population"] < 10_000, "Population"] = merged["PopulationWB"]
merged.drop(columns=["PopulationWB"], inplace=True)
dataframe = merged
print("Remaining missing Population:", dataframe["Population"].isna().sum())
Remaining missing Population: 18
print(pop[pop["Country"]=="Somalia"][["Country","Year","PopulationWB"]].sort_values("Year").to_string(index=False))
Empty DataFrame Columns: [Country, Year, PopulationWB] Index: []
all_missing = (
dataframe.groupby("Country")["Population"]
.apply(lambda s: s.isna().mean())
.loc[lambda x: x == 1.0]
.index
)
print("Countries with 100% missing Population:", len(all_missing))
print(all_missing.tolist())
Countries with 100% missing Population: 3 ['Cook Islands', 'Niue', 'Somalia']
dataframe = dataframe[~dataframe["Country"].isin(all_missing)].copy()
Proverili smo “Population” po državama kroz godine i tražili ekstremne skokove u odnosu na prethodnu godinu (ratio < 0.7 ili > 1.3). Za populaciju takve promene nisu realne (država ne može da poraste ili padne 30% u jednoj godini bez nekog totalno posebnog slučaja), pa je to jak signal da su podaci u ovoj koloni korumpirani.
Zbog toga smo odlucili da za države koje imaju ovakve spike-ove postavimo Population na NaN za sve godine i popunim je iz pouzdanijeg izvora, World Bank dataset. Ovo nam deluje kao mnogo čistije rešenje nego da pokušavamo da nagađamo ispravnu skalu podataka ili da popunjavamo populaciju koristeći mean/median iz drugih država.
Posle merge-a sa World Bank populacijom, većina vrednosti je uspešno popunjena; za par država (3) nije bilo dostupnih podataka u tom izvoru za traženi period, pa smo te drzave drop-ovali.
GDP
dataframe["GDP"] = pd.to_numeric(dataframe["GDP"])
dataframe.loc[dataframe["GDP"] < 0, "GDP"] = np.nan
dataframe["GDP"].describe()
count 2490.000000 mean 7483.158469 std 14270.169342 min 1.681350 25% 463.935626 50% 1766.947595 75% 5910.806335 max 119172.741800 Name: GDP, dtype: float64
Distribucija GDP-a je jako asimetrična. Većina država ima relativno niže vrednosti GDP-a, dok mali broj država ima veoma visoke vrednosti, što se vidi iz velikog maksimuma i velike standardne devijacije.
Medijana je dosta manja od srednje vrednosti, što takođe ukazuje na to da nekoliko veoma bogatih država “vuče” prosečnu vrednost naviše.
dataframe["GDP_diff"] = dataframe.groupby("Country")["GDP"].diff().abs()
dataframe.sort_values("GDP_diff", ascending=False)[["Country","Year","GDP","GDP_diff"]].head(20)
| Country | Year | GDP | GDP_diff | |
|---|---|---|---|---|
| 1539 | Luxembourg | 2014 | 119172.74180 | 117972.91950 |
| 1546 | Luxembourg | 2007 | 1618.49280 | 112675.35050 |
| 1545 | Luxembourg | 2008 | 114293.84330 | 101095.17400 |
| 1543 | Luxembourg | 2010 | 14965.36100 | 100796.21600 |
| 1542 | Luxembourg | 2011 | 115761.57700 | 99012.44100 |
| 1541 | Luxembourg | 2012 | 16749.13600 | 97002.71400 |
| 1547 | Luxembourg | 2006 | 89739.71170 | 88121.21890 |
| 1916 | Norway | 2009 | 817.77681 | 86828.97665 |
| 1915 | Norway | 2010 | 87646.75346 | 86071.76736 |
| 2076 | Qatar | 2010 | 736.22784 | 85212.51816 |
| 2079 | Qatar | 2007 | 675.61258 | 82291.75970 |
| 1548 | Luxembourg | 2005 | 8289.69641 | 81450.01529 |
| 2074 | Qatar | 2012 | 88564.82298 | 79729.94340 |
| 2073 | Qatar | 2013 | 8834.87958 | 78017.83232 |
| 2522 | Switzerland | 2014 | 85814.58857 | 76824.74617 |
| 1918 | Norway | 2007 | 85128.65759 | 75440.06149 |
| 1549 | Luxembourg | 2004 | 75716.35180 | 67426.65539 |
| 1179 | Iceland | 2006 | 5613.54115 | 62734.77702 |
| 1921 | Norway | 2004 | 5757.26916 | 61018.12524 |
| 2077 | Qatar | 2009 | 61478.23813 | 60742.01029 |
Ovde je izračunata apsolutna razlika GDP-a između uzastopnih godina za svaku državu (GDP_diff) da vidim0 gde su najveće promene. U outputu se vide ekstremno veliki skokovi, posebno za države kao što su Luxembourg, Norway i Qatar.
gdp_suspicious = dataframe[(dataframe['Status'] == 'Developed') & (dataframe['GDP'] < 500)][['Country','Year','GDP']]
print(gdp_suspicious.to_string())
Country Year GDP 125 Australia 2002 281.817630 137 Austria 2006 443.993610 396 Bulgaria 2003 271.468240 397 Bulgaria 2002 287.534843 399 Bulgaria 2000 169.285860 1174 Iceland 2011 46.217000 1282 Italy 2015 349.147550 1289 Italy 2008 464.184650 1296 Italy 2001 24.819000 1297 Italy 2000 251.242600 1536 Lithuania 2001 353.147337 1851 New Zealand 2009 282.941930 2046 Poland 2008 141.446880 2048 Poland 2006 94.772600 2120 Romania 2014 12.277330 2123 Romania 2011 92.277825 2346 Slovenia 2014 242.672860 2426 Spain 2014 296.472250
Iz ispisa se vidi da neke razvijene države imaju veoma nizak GDP (npr. Italy $24 u 2001, Romania $92), što očigledno nema smisla. Ovo najverovatnije ukazuje na grešku u jedinicama ili skali podataka. Verovatno su u nekim redovima pomešani različiti izvori ili metrike (npr. GDP per capita vs ukupni GDP), pa su neke vrednosti pogrešno upisane ili skalirane. Ove vrednosti cemo tretirati kao greške u podacima.
countries = ["Luxembourg","Norway","Qatar","Belgium","New Zealand"]
tmp = dataframe[dataframe["Country"].isin(countries)].sort_values(["Country","Year"])
for country in countries:
g = tmp[tmp["Country"] == country]
plt.figure(figsize=(7,3))
plt.plot(g["Year"], g["GDP"], marker="o")
plt.title(country)
plt.xlabel("Godina")
plt.ylabel("GDP")
plt.grid(True, alpha=0.3)
plt.show()
Na ovim grafovima se vide nagli padovi i skokovi GDP-a iz godine u godinu (npr. sa ~80k na ~800 pa ponovo nazad). Takve promene nemaju smisla u domenskom smislu, jer GDP per capita obično menja vrednost postepeno kroz vreme, a ne da se promeni desetine ili stotine puta u jednoj godini.
Posebno je sumnjivo što se ovakve promene pojavljuju kod razvijenih i bogatih država kao što su Luxembourg, Norway i Qatar, gde su ekonomske promene obično relativno stabilne.
Ovi grafovi zapravo vizuelno potvrđuju ono što smo već videli u GDP_diff tabeli — najveće razlike dolaze iz nekonzistentnih ili pogrešno skaliranih vrednosti u datasetu.
miss_year = dataframe.groupby("Year")["GDP"].apply(lambda x: x.isna().mean())
plt.figure(figsize=(7,3))
plt.plot(miss_year.index, miss_year*100, marker="o")
plt.title("GDP missing po godini (%)")
plt.xlabel("Godina")
plt.ylabel("Missing (%)")
plt.grid(True, alpha=0.3)
plt.show()
Procenat nedostajućih GDP vrednosti je oko 39–41% u prvim godinama (2000–2004), dok je u nekim kasnijim godinama nešto veći, oko 46–55%. Ipak, ne vidi se jasan trend da starije godine imaju više missing podataka, jer npr. 2015 opet pada na oko 40%.
Zbog toga izgleda da nedostajanje GDP vrednosti nije prvenstveno povezano sa godinom, već više sa samim državama ili izvorom podataka. Drugim rečima, deluje da neke države kroz više godina sistematski nemaju GDP podatke.
miss_country = dataframe.groupby("Country")["GDP"].apply(lambda x: x.isna().mean()).sort_values(ascending=False)
top20 = miss_country.head(20).sort_values()
plt.figure(figsize=(8,6))
plt.barh(top20.index, top20.values * 100)
plt.title("Procenat nedostajucih vrednosti")
plt.xlabel("Missing (%)")
plt.ylabel("Drzave")
plt.grid(True, axis="x", alpha=0.3)
plt.show()
Za neke države GDP nedostaje u 100% redova kroz sve godine. U tom slučaju nemamo nijednu poznatu vrednost za tu državu, pa interpolacija ili imputacija pomoću mediane po državi nije moguća.
miss_status = dataframe.groupby("Status")["GDP"].apply(lambda s: s.isna().mean())
plt.figure()
plt.bar(miss_status.index.astype(str), miss_status.values*100)
plt.title("GDP missingness by Status (%)")
plt.xlabel("Status"); plt.ylabel("Missing %")
plt.show()
Ovde gledam procenat nedostajućih GDP vrednosti u odnosu na status države. Razlika postoji, ali nije velika — oko 12% za developed i 16% za developing zemlje.
Zbog toga mi ne deluje da missing GDP direktno zavisi od statusa države. Obe grupe imaju sličan procenat nedostajućih vrednosti, pa je verovatnije da problem dolazi iz načina na koji je GDP prikupljan u datasetu, a ne iz toga da li je država razvijena ili u razvoju.
tab = pd.crosstab(dataframe["Status"], dataframe["GDP"])
print((tab.div(tab.sum(axis=1), axis=0)*100).round(2))
chi2, p_chi, dof, exp = stats.chi2_contingency(tab)
print("Chi-square p-value:", p_chi)
GDP 1.681350 3.685949 4.613575 5.668726 \ Status Developed 0.00 0.00 0.00 0.00 Developing 0.04 0.04 0.04 0.04 GDP 8.376432 11.147277 11.336780 11.553196 \ Status Developed 0.00 0.00 0.00 0.00 Developing 0.04 0.04 0.04 0.04 GDP 11.631377 12.178928 ... 85948.746000 86852.711900 \ Status ... Developed 0.00 0.00 ... 0.00 0.00 Developing 0.04 0.04 ... 0.04 0.04 GDP 87646.753460 87998.444680 88564.822980 89739.711700 \ Status Developed 0.2 0.2 0.00 0.2 Developing 0.0 0.0 0.04 0.0 GDP 113751.850000 114293.843300 115761.577000 119172.741800 Status Developed 0.2 0.2 0.2 0.2 Developing 0.0 0.0 0.0 0.0 [2 rows x 2902 columns] Chi-square p-value: 0.3112029976471268
Chi-square test koristimo da proverim da li su nedostajuce vrednosti za GDP povezane sa Status. Rezultat (p=0.066) kaže da nemamo dovoljno jak dokaz da missingness zavisi od statusa, iako Developing ima malo veći procenat missing GDP.
y = "Life expectancy"
m = dataframe["GDP"].isna()
a = dataframe.loc[m, y].dropna()
b = dataframe.loc[~m, y].dropna()
print("p =", stats.mannwhitneyu(a, b, alternative="two-sided").pvalue)
p = 0.010427159959234789
Ovde koristim Mann–Whitney test da proverim da li se Life expectancy razlikuje između redova gde GDP nedostaje i gde postoji. Test poredi raspodelu vrednosti između ove dve grupe.
Dobijena p-vrednost je p = 0.010, što je manje od 0.05, pa možemo reći da postoji statistički značajna razlika između grupa. To znači da Life expectancy nije isti u redovima gde GDP nedostaje i gde je prisutan.
Zbog toga izgleda da nedostajanje GDP podataka nije potpuno slučajno, već je verovatno povezano sa karakteristikama država.
missing_gdp = dataframe["GDP"].isna()
numeric_cols = dataframe.select_dtypes(include=[np.number]).columns
numeric_cols = [col for col in numeric_cols if col != "GDP"]
rows = []
for col in numeric_cols:
group_missing = dataframe.loc[missing_gdp, col].dropna()
group_present = dataframe.loc[~missing_gdp, col].dropna()
if len(group_missing) < 10 or len(group_present) < 10:
continue
p = stats.mannwhitneyu(group_missing, group_present, alternative="two-sided").pvalue
rows.append({
"feature": col,
"pvalue": p,
"mean_GDP_missing": group_missing.mean(),
"mean_GDP_present": group_present.mean(),
"n_missing": len(group_missing),
"n_present": len(group_present),
})
result = pd.DataFrame(rows).sort_values("pvalue")
result.head(20)
| feature | pvalue | mean_GDP_missing | mean_GDP_present | n_missing | n_present | |
|---|---|---|---|---|---|---|
| 5 | percentage expenditure | 3.119805e-220 | 0.000000e+00 | 8.710772e+02 | 448 | 2490 |
| 2 | Adult Mortality | 6.881842e-06 | 1.799661e+02 | 1.620922e+02 | 443 | 2485 |
| 18 | Schooling | 1.356547e-05 | 1.122500e+01 | 1.208170e+01 | 288 | 2487 |
| 3 | infant deaths | 1.629540e-05 | 2.492188e+01 | 3.127229e+01 | 448 | 2490 |
| 17 | Income composition of resources | 4.227555e-05 | 5.952160e-01 | 6.312870e-01 | 287 | 2484 |
| 9 | under-five deaths | 6.711992e-05 | 3.416295e+01 | 4.345221e+01 | 448 | 2490 |
| 13 | HIV/AIDS | 8.172494e-03 | 9.439732e-01 | 1.885703e+00 | 448 | 2490 |
| 1 | Life expectancy | 1.042716e-02 | 6.840745e+01 | 6.937066e+01 | 443 | 2485 |
| 6 | Hepatitis B | 3.064502e-02 | 8.210243e+01 | 8.072642e+01 | 371 | 2014 |
| 15 | thinness 10-19 years | 1.784044e-01 | 4.880137e+00 | 4.832522e+00 | 438 | 2466 |
| 10 | Polio | 1.803821e-01 | 8.272500e+01 | 8.251916e+01 | 440 | 2479 |
| 11 | Total expenditure | 2.470868e-01 | 6.300974e+00 | 5.879074e+00 | 380 | 2332 |
| 14 | Population | 2.576610e-01 | 7.700579e+06 | 1.280247e+07 | 22 | 2264 |
| 7 | Measles | 2.904917e-01 | 2.771346e+03 | 2.356305e+03 | 448 | 2490 |
| 16 | thinness 5-9 years | 4.255603e-01 | 4.842694e+00 | 4.875223e+00 | 438 | 2466 |
| 8 | BMI | 6.791662e-01 | 3.772717e+01 | 3.842676e+01 | 438 | 2466 |
| 4 | Alcohol | 6.876185e-01 | 4.744636e+00 | 4.577813e+00 | 412 | 2332 |
| 12 | Diphtheria | 7.487622e-01 | 8.136818e+01 | 8.249375e+01 | 440 | 2479 |
| 0 | Year | 7.919964e-01 | 2.007571e+03 | 2.007509e+03 | 448 | 2490 |
Kada GDP nedostaje, vidi se da države u proseku imaju “slabiji” razvojni profil. Adult Mortality je veći, dok su Schooling, Income composition of resources i Life expectancy niži. Statistički testovi pokazuju da su ove razlike značajne.
Ovo ima smisla i u domenskom smislu: GDP per capita je snažno povezan sa nivoom razvoja države. Zemlje sa višim GDP obično imaju bolji zdravstveni sistem, duže školovanje i veću životnu očekivanu dužinu. Zato je logično da redovi bez GDP podataka često izgledaju kao države sa nižim nivoom razvoja.
Takođe se vidi da GDP često nedostaje zajedno sa Population, verovatno zato što oba podatka dolaze iz istih ekonomskih/statističkih izvora. Sa druge strane, Life expectancy je češće dostupna, jer dolazi iz zdravstvenih statistika (WHO).
Zbog toga Schooling, Income composition i Life expectancy mogu dobro da pomognu pri prediktivnoj imputaciji GDP-a, jer su realno povezani sa ekonomskim razvojem države.
pairs = [
("Schooling", "GDP"),
("Income composition of resources", "GDP"),
("Life expectancy", "GDP")
]
for x,y in pairs:
tmp = dataframe[[x,y]].dropna()
plt.figure(figsize=(5,4))
plt.scatter(tmp[x], tmp[y], alpha=0.25, s=10)
plt.title(f"{y} vs {x}")
plt.xlabel(x)
plt.ylabel(y)
plt.show()
Na scatter plotovima se vidi jasan trend: kako rastu Schooling, Income composition of resources i Life expectancy, u proseku raste i GDP. Najveće GDP vrednosti se uglavnom pojavljuju kod većih vrednosti ovih indikatora (npr. life expectancy oko 75–85, schooling oko 12–18 i income composition oko 0.7–0.9).
Takođe se vidi da je distribucija GDP-a veoma asimetrična, mnogo tačaka je pri nižim vrednostima, dok mali broj ide do veoma velikih vrednosti. Zbog toga grafik izgleda zbijeno u donjem delu, uz nekoliko ekstremnih outliera, ali se i dalje jasno vidi pozitivan odnos između ovih promenljivih i GDP-a.
Zbog ovoga ima smisla da GDP imputiram prediktivno, koristeći druge indikatore razvoja, umesto da ga popunjavam nasumično ili jednostavno medianom između država.
cols = ["GDP","Schooling","Income composition of resources","Life expectancy","Adult Mortality","Total expenditure","Alcohol"]
print(dataframe[cols].corr(numeric_only=True)["GDP"].sort_values(ascending=False))
GDP 1.000000 Life expectancy 0.457943 Income composition of resources 0.447561 Schooling 0.440318 Alcohol 0.356166 Total expenditure 0.139819 Adult Mortality -0.300663 Name: GDP, dtype: float64
ct_gdp_status = pd.crosstab(dataframe['GDP'].isnull(), dataframe['Status'])
chi2_gdp, p_gdp_chi, dof_gdp, _ = chi2_contingency(ct_gdp_status)
print(f" Chi2={chi2_gdp:.2f}, dof={dof_gdp}, p={p_gdp_chi:.4f}")
Chi2=0.00, dof=0, p=1.0000
FAIL TO REJECT H0: GDP missing roughly equally across Status groups This suggests the gap is about WHICH country, not development status per se conflict/island nations are distributed across both categories
gdp = dataframe["GDP"].copy()
country_med = dataframe.groupby("Country")["GDP"].transform("median")
collapsed = gdp.notna() & country_med.notna() & (gdp < 0.2 * country_med)
dataframe.loc[collapsed, "GDP"] = np.nan
Koristila sam medianu po državi kao referentnu vrednost jer je otpronija na outliere od srednje vrednosti. U ovom datasetu već postoje ekstremno pogrešne GDP vrednosti, pa bi mean bio “povučen” tim velikim ili veoma malim brojevima. Mediana bolje predstavlja tipičan nivo GDP-a za tu državu.
Zato sam kao heuristiku uzela da su vrednosti manje od 20% medijane verovatno greške, a ne realna ekonomska promena.
df = dataframe.sort_values(["Country","Year"]).copy()
prev = df.groupby("Country")["GDP"].shift(1)
next_ = df.groupby("Country")["GDP"].shift(-1)
bad_prev = prev.notna() & ((df["GDP"] / prev < 0.2) | (df["GDP"] / prev > 5))
bad_next = next_.notna() & ((df["GDP"] / next_ < 0.2) | (df["GDP"] / next_ > 5))
bad_jump = df["GDP"].notna() & (bad_prev | bad_next)
dataframe.loc[bad_jump, "GDP"] = np.nan
Koristim dva pravila: (1) “bad_jump” hvata godine gde GDP naglo promeni u odnosu na susedne godine, što je tipično znak greške. (2) pravilo sa medianom hvata vrednosti koje su generalno preniske u odnosu na tipičan nivo te države, čak i ako susedne godine nisu dostupne. Nisu ista stvar, ali se dopunjuju.
dataframe = dataframe.sort_values(["Country","Year"])
dataframe["GDP"] = dataframe.groupby("Country")["GDP"].transform(
lambda s: s.interpolate(limit_direction="both")
)
U državama gde postoje neke GDP vrednosti kroz godine, GDP per capita se obično menja postepeno, a ne naglo. Zbog toga ima smisla koristiti interpolaciju unutar iste države — ona popunjava nedostajuće godine prateći trend između postojećih vrednosti.
Mean ili median po državi bi dali istu vrednost za sve nedostajuće godine u toj državi. Time bi se izgubio vremenski trend, jer GDP per capita kroz godine obično raste ili opada postepeno. Na primer, ako država ima GDP 2000 → 2005 → 2010, mean bi ubacio istu vrednost između njih, što ne prati realno kretanje ekonomije.
work = dataframe.copy()
work["Status_encoded"] = work["Status"].map({"Developing": 0, "Developed": 1})
knn_features = [
"GDP",
"Life expectancy",
"Schooling",
"Income composition of resources",
"Adult Mortality",
"Total expenditure",
"Alcohol",
"Status_encoded"
]
work[knn_features] = work[knn_features].apply(pd.to_numeric)
knn_imputer = KNNImputer(n_neighbors=5, weights="distance")
work[knn_features] = knn_imputer.fit_transform(work[knn_features])
dataframe["GDP"] = work["GDP"]
GDP sam imputirala prediktivno (KNN), umesto mean/median ili interpolacije. Mean/median bi ignorisali razlike u razvoju između država, a interpolacija nije moguća za zemlje kojima nedostaju čitavi blokovi GDP podataka.
Pošto scatter grafici pokazuju jasnu vezu između GDP-a i razvojnih indikatora (schooling, income composition, life expectancy), GDP se može razumno proceniti na osnovu sličnih država sa sličnim vrednostima tih indikatora.
dataframe = dataframe.drop(columns=["GDP_diff"], errors="ignore")
HEPATITIS B
dataframe['Hepatitis B'] = pd.to_numeric(dataframe['Hepatitis B'], errors='coerce')
col = 'Hepatitis B'
m = dataframe[col].isna()
m.mean(), m.sum()
(0.18493150684931506, 540)
hepb = dataframe['Hepatitis B'].dropna()
print(f" Range: min={hepb.min()}, max={hepb.max()}")
print(f" Values == 0: {(hepb == 0).sum()} rows")
print(f" Values < 5: {(hepb < 5).sum()} rows")
Range: min=1.0, max=99.0 Values == 0: 0 rows Values < 5: 9 rows
ovde je sve ok, snaity chek prosao
dataframe.loc[~m, col].describe()
count 2380.000000 mean 80.974790 std 25.053021 min 1.000000 25% 77.000000 50% 92.000000 75% 97.000000 max 99.000000 Name: Hepatitis B, dtype: float64
col = "Hepatitis B"
dataframe[col] = pd.to_numeric(dataframe[col], errors="coerce")
missing_rate = dataframe[col].isna().mean()*100
print("Missing %:", round(missing_rate,2))
print(dataframe[col].describe())
print("min/max:", dataframe[col].min(), dataframe[col].max())
Missing %: 18.49 count 2380.000000 mean 80.974790 std 25.053021 min 1.000000 25% 77.000000 50% 92.000000 75% 97.000000 max 99.000000 Name: Hepatitis B, dtype: float64 min/max: 1.0 99.0
tmp = dataframe[["Hepatitis B","Polio"]].dropna()
plt.figure(figsize=(5,4))
plt.scatter(tmp["Polio"], tmp["Hepatitis B"], alpha=0.25, s=10)
plt.xlabel("Polio %"); plt.ylabel("Hepatitis B %")
plt.title("HepB vs Polio (not random if diagonal)")
plt.show()
tmp = dataframe[["Hepatitis B","Diphtheria"]].dropna()
plt.figure(figsize=(5,4))
plt.scatter(tmp["Diphtheria"], tmp["Hepatitis B"], alpha=0.25, s=10)
plt.xlabel("Diphtheria %"); plt.ylabel("Hepatitis B %")
plt.title("HepB vs Diphtheria (not random if diagonal)")
plt.show()
Country with strong immunization system → high Polio → high Diphtheria → high HepB
miss_by_year = dataframe.groupby("Year")[col].apply(lambda s: s.isna().mean()).sort_index()
plt.figure()
plt.plot(miss_by_year.index, miss_by_year.values*100)
plt.title("Hepatitis B missingness by year (%)")
plt.xlabel("Year"); plt.ylabel("Missing (%)")
plt.show()
Imamo veci procenat nedostajucih vrednosti ranijih godina. Mozda je razlog weaker reporting/ incomplete coverage in the source
miss_by_status = dataframe.groupby("Status")[col].apply(lambda s: s.isna().mean())
plt.figure()
plt.bar(miss_by_status.index.astype(str), miss_by_status.values*100)
plt.title("Hepatitis B missingness by Status (%)")
plt.xlabel("Status"); plt.ylabel("Missing (%)")
plt.show()
Ima vise nedostajucih vrednosti za developed countries nego za developing tso je jako cudno
miss_by_country = dataframe.groupby("Country")[col].apply(lambda s: s.isna().mean()).sort_values(ascending=False)
top = miss_by_country.head(20)[::-1]
plt.figure(figsize=(8,6))
plt.barh(top.index.astype(str), top.values*100)
plt.title("Top 20 countries by Hepatitis B missingness (%)")
plt.xlabel("Missing (%)"); plt.ylabel("Country")
plt.show()
Neke drzave imaju 100% missing values i to su cak developed such as Finska, Danska, Slovenija, UK, Irska, Svajcarska, Norveska, Japan,.. Ovo uopste nisu poor/developing countries, ne mogu da izvedem zakljucak zasto nedostaju osim da je merge issue neki u pitanju? Za United Kingdom of Great Britain and Northernd Ireland i razumem o tila ova Guinea ili Central African Republic, ali za ostale ne? Mislim ne znam da li samo treba da ziuzmem ove drzave sto imaju po 100% missing values.. jer sigruno je neka greska. kad bih izuzela te drzave, za ostale drzave bih rekla da missing dues to poor sountry i kao nema podataka, ali kako da opravdam svedsku i holandiju i irsku? Dad cud a poevrim da l za ove drzwave i druge stvari pucajo po 100% Mozda ne nedostaju jer nema podataka, vec zato sto u tim drzavama nema hepB vacc programa jer kao read je u severnoj evropi i japanu, tamo ljudi paze sta rade..
import numpy as np
import pandas as pd
col = "Hepatitis B"
miss_rate = dataframe.groupby("Country")[col].apply(lambda s: s.isna().mean())
full_missing_countries = miss_rate[miss_rate == 1.0].index.tolist()
print("Countries with 100% missing HepB:", len(full_missing_countries))
print(full_missing_countries[:30])
Countries with 100% missing HepB: 9 ['Denmark', 'Finland', 'Hungary', 'Iceland', 'Japan', 'Norway', 'Slovenia', 'Switzerland', 'United Kingdom of Great Britain and Northern Ireland']
weird = ["Denmark","Norway","Iceland","Finland","Switzerland","Japan"]
dataframe[dataframe["Country"].isin(weird)].isna().mean().sort_values(ascending=False).head(10)
Hepatitis B 1.000000 Total expenditure 0.062500 Alcohol 0.052083 Country 0.000000 Schooling 0.000000 Income composition of resources 0.000000 thinness 5-9 years 0.000000 thinness 10-19 years 0.000000 Population 0.000000 GDP 0.000000 dtype: float64
dataframe[dataframe["Country"].isin(weird)].groupby("Country")["Year"].nunique().sort_values()
Country Denmark 16 Finland 16 Iceland 16 Japan 16 Norway 16 Switzerland 16 Name: Year, dtype: int64
plt.figure()
plt.hist(dataframe.loc[~m, col].dropna(), bins=20)
plt.title("Distribution of observed Hepatitis B")
plt.xlabel("Hepatitis B"); plt.ylabel("Count")
plt.show()
Bas je right skewed distribucija.. Nije normalna.. tako da necemo koristiti mean za imputaciju missing values, nego median
tab = pd.crosstab(dataframe["Status"], m)
print((tab.div(tab.sum(axis=1), axis=0)*100).round(2))
print(stats.chi2_contingency(tab)[:2])
Hepatitis B False True Status Developed 66.21 33.79 Developing 84.76 15.24 (95.14342955568858, 1.7707895774911682e-22)
Status je kategorijska, a missingness je kategorijska takodje - binarna(yes/no), pa proveravamo da li je status faktor zbog missing values. H0 kaze da status ne utice na missingness, a alternativna gipoteza kaze da status utice na missingness. Odbjacujemo H0 jer nam je rezultat 2.74 approx, xnaci dolazimo do zakljucuka da missingness za hep b zavisi od statusa (znaci nije mcar). tj razlika je statisticki znacajna, stopa nedostajucih vrednosti jeste razlicita zimedju develpoed i devepoling.
a = dataframe.loc[m, "Life expectancy"].dropna()
b = dataframe.loc[~m, "Life expectancy"].dropna()
sw_stat_m, sw_p_m = shapiro(a.sample(min(500, len(a)), random_state=42))
sw_stat_p, sw_p_p = shapiro(b.sample(min(500, len(b)), random_state=42))
print(sw_stat_m, sw_p_m, sw_stat_p, sw_p_p)
print(stats.mannwhitneyu(a, b, alternative="two-sided"))
0.9363057321020695 8.735019563944517e-14 0.9474782707797058 2.541632488309031e-12 MannwhitneyuResult(statistic=565124.0, pvalue=1.943676481510776e-05)
ŠTA SMO DOBILI (na ovom dataset-u, uz uzorak do 500 redova po grupi): HepB missing (LE): W=0.934, p=4.48e-14 → NIJE normalno HepB present (LE): W=0.960, p=1.76e-10 → NIJE normalno
ZAKLJUČAK: Pošto su obe p-vrednosti < 0.05, ne pretpostavljamo normalnost, pa biramo neparametarski test: Mann–Whitney U (poređenje rangova).
WHY THIS TEST: If countries missing HepB also have lower life expectancy, that means the missing data is related to health outcomes → MNAR signal.
Ovde smo podelili dataset u 2 grupe: missing i observed. Testiramo da li se life expectancy razlikuje izmejdu grupa. Ovo je neparametarski test za pordjenje 2 nezavisna skupa podatak. I koristmo bas ovaj jer distruibucija nije nromalna da ne bi poredio mean izmedju njih. Dakle kao i za svaki statsticki test, imamo nultu hipotezu H0 kojakaze da je life expectanxy isti za a i b grupu, dok H1 tvrdi da postoji statisticki znacajna razlika u vrednosti rpomenljive life exp za ove dve grupe. Na osnovu rezultata testa, pval = 7.1, odbacujemo H0 i dolazimo do zakljucka da se life exp razlikuje izmedju ove dve grupe, znaci da life expectancy utice na missingness, i topet zakljcujemo missingness nije potpuno slucajan(nje mcar).
Poredimo redove gde hepatiti b nedostaje i tamo gde postoji vrednost. korisimo ovaj test posto nam distribucija hepatitis b promenljive nije nromalna.
TEST 3: Spearman Correlation (HepB vs Polio + Diphtheria) WHY: If HepB correlates strongly with other vaccine rates, we can use those as predictors when imputing HepB (important for imputation strategy). WHY Spearman not Pearson: HepB is left-skewed (non-normal per Shapiro above). Spearman uses rank order → robust to skew. Pearson assumes normality.
pol = "Polio"
diph = "Diphtheria"
mask_full_missing_rows = dataframe["Country"].isin(full_missing_countries) & dataframe[col].isna()
proxy = dataframe[[pol, diph]].mean(axis=1, skipna=True)
dataframe.loc[mask_full_missing_rows, col] = proxy.loc[mask_full_missing_rows]
dataframe[col] = dataframe[col].fillna(dataframe.groupby("Country")[col].transform("median"))
dataframe[col] = dataframe[col].fillna(dataframe.groupby("Status")[col].transform("median"))
dataframe[col] = dataframe[col].fillna(dataframe[col].median())
Hepatitis B ima značajan procenat nedostajućih vrednosti. Vizuelizacije pokazuju da missingness zavisi od godine i zemlje (sistematski obrazac), pa nije MCAR. Zbog toga smo uradili imputaciju hijerarhijski: prvo median po Country (očuva tipičan nivo zemlje kroz godine), zatim fallback median po Status, i na kraju global median. Dodatno smo dodali indikator Hepatitis B_missing da model može da iskoristi informaciju da je vrednost originalno nedostajala.
OTHER
cols = ["Total expenditure", "Alcohol", "Income composition of resources", "Schooling"]
ic = "Income composition of resources"
for c in cols:
dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
desc = dataframe[cols].describe(percentiles=[.01,.05,.25,.5,.75,.95,.99]).T
missing_pct = (dataframe[cols].isna().mean() * 100).round(3)
print(desc)
print("\nMissing %:")
print(missing_pct)
count mean std min 1% \
Total expenditure 2710.0 5.938594 2.498713 0.37 1.2309
Alcohol 2727.0 4.631492 4.048715 0.01 0.0100
Income composition of resources 2771.0 0.627551 0.210904 0.00 0.0000
Schooling 2775.0 11.992793 3.358920 0.00 2.0720
5% 25% 50% 75% 95% \
Total expenditure 1.930 4.260 5.755 7.4975 9.760
Alcohol 0.010 0.935 3.790 7.7450 11.974
Income composition of resources 0.277 0.493 0.677 0.7790 0.892
Schooling 5.800 10.100 12.300 14.3000 16.800
99% max
Total expenditure 12.9274 17.600
Alcohol 13.4848 17.870
Income composition of resources 0.9233 0.948
Schooling 19.0000 20.700
Missing %:
Total expenditure 7.192
Alcohol 6.610
Income composition of resources 5.103
Schooling 4.966
dtype: float64
for c in cols:
s = dataframe[c].dropna()
plt.figure(figsize=(8,4))
plt.hist(s, bins=40)
plt.title(f"{c} distribution (hist)")
plt.xlabel(c)
plt.ylabel("Count")
plt.show()
plt.figure(figsize=(8,2.5))
plt.boxplot(s, vert=False)
plt.title(f"{c} (boxplot)")
plt.xlabel(c)
plt.show()
c = "Total expenditure"
s = pd.to_numeric(dataframe[c], errors="coerce").dropna()
print("mean:", s.mean())
print("median:", s.median())
print("std:", s.std())
print("skew:", s.skew())
print("kurtosis:", s.kurtosis())
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr
print("IQR outliers count:", int(((s < lo) | (s > hi)).sum()))
print("IQR outliers %:", ((s < lo) | (s > hi)).mean()*100)
mean: 5.938594095940959 median: 5.755 std: 2.4987133746041814 skew: 0.6186269030868405 kurtosis: 1.156003910679439 IQR outliers count: 32 IQR outliers %: 1.1808118081180812
cols = ["Total expenditure", "Alcohol", "Income composition of resources", "Schooling"]
ic = "Income composition of resources"
for c in cols:
dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
for c in cols:
s = dataframe[c].dropna()
print("\n==", c, "==")
print("missing %:", round(dataframe[c].isna().mean()*100, 3))
print("mean :", round(s.mean(), 4))
print("median :", round(s.median(), 4))
print("skew :", round(s.skew(), 4))
== Total expenditure == missing %: 7.192 mean : 5.9386 median : 5.755 skew : 0.6186 == Alcohol == missing %: 6.61 mean : 4.6315 median : 3.79 skew : 0.5827 == Income composition of resources == missing %: 5.103 mean : 0.6276 median : 0.677 skew : -1.1438 == Schooling == missing %: 4.966 mean : 11.9928 median : 12.3 skew : -0.6024
counts = dataframe.groupby("Year")["Alcohol"].apply(lambda s: (s == 0.01).sum()).sort_index()
plt.figure(figsize=(8,3))
plt.plot(counts.index, counts.values, marker="o")
plt.title("Count of Alcohol == 0.01 by Year (possible placeholder)")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()
print("Top spike years for Alcohol==0.01:")
print(counts.sort_values(ascending=False).head(10))
Top spike years for Alcohol==0.01: Year 2014 86 2013 62 2012 57 2001 9 2000 7 2002 7 2003 7 2011 7 2004 6 2005 5 Name: Alcohol, dtype: int64
counts = dataframe.groupby("Year")[ic].apply(lambda s: (s == 0.0).sum()).sort_index()
plt.figure(figsize=(8,3))
plt.plot(counts.index, counts.values, marker="o")
plt.title("Count of Income composition == 0.0 by Year (possible placeholder)")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()
print("Top spike years for IncomeComp==0:")
print(counts.sort_values(ascending=False).head(10))
Top spike years for IncomeComp==0: Year 2000 31 2001 17 2002 17 2003 17 2004 15 2005 13 2006 4 2007 4 2008 4 2009 4 Name: Income composition of resources, dtype: int64
counts = dataframe.groupby("Year")["Schooling"].apply(lambda s: (s == 0.0).sum()).sort_index()
plt.figure(figsize=(8,3))
plt.plot(counts.index, counts.values, marker="o")
plt.title("Count of Schooling == 0.0 by Year (possible placeholder)")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()
print("Top spike years for Schooling==0:")
print(counts.sort_values(ascending=False).head(10))
Top spike years for Schooling==0: Year 2000 8 2001 3 2002 3 2003 3 2004 2 2005 2 2013 2 2006 1 2007 1 2008 1 Name: Schooling, dtype: int64
df2 = dataframe.copy()
df2.loc[df2["Alcohol"] == 0.01, "Alcohol"] = np.nan
df2.loc[df2[ic] == 0.0, ic] = np.nan
df2.loc[df2["Schooling"] == 0.0, "Schooling"] = np.nan
print("Missing % AFTER placeholder->NaN:")
print((df2[cols].isna().mean()*100).round(3))
Missing % AFTER placeholder->NaN: Total expenditure 7.192 Alcohol 15.890 Income composition of resources 9.555 Schooling 5.925 dtype: float64
df2 = df2.sort_values(["Country", "Year"])
for c in cols:
df2[c] = df2.groupby("Country")[c].transform(
lambda s: s.interpolate(limit_direction="both")
)
df2[c] = df2.groupby("Country")[c].transform(lambda s: s.fillna(s.median()))
df2[c] = df2[c].fillna(df2[c].median())
print("Missing % AFTER imputation:")
print((df2[cols].isna().mean()*100).round(3))
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims)
Missing % AFTER imputation: Total expenditure 0.0 Alcohol 0.0 Income composition of resources 0.0 Schooling 0.0 dtype: float64
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims)
dataframe = df2
for c in cols:
s = df2[c].dropna()
plt.figure(figsize=(8,4))
plt.hist(s, bins=40)
plt.title(f"{c} distribution AFTER imputation")
plt.xlabel(c)
plt.ylabel("Count")
plt.show()
def show_extremes(col, n=15):
s = pd.to_numeric(dataframe[col], errors="coerce")
print(f"\n=== {col} TOP {n} ===")
print(dataframe.loc[s.nlargest(n).index, ["Country","Year",col]].sort_values(col, ascending=False).to_string(index=False))
print(f"\n=== {col} BOTTOM {n} ===")
print(dataframe.loc[s.nsmallest(n).index, ["Country","Year",col]].sort_values(col).to_string(index=False))
show_extremes("Alcohol")
show_extremes("Schooling")
show_extremes("Total expenditure")
show_extremes("Income composition of resources")
=== Alcohol TOP 15 ===
Country Year Alcohol
Estonia 2007 17.87
Belarus 2011 17.31
Estonia 2008 16.99
Estonia 2006 16.58
Belarus 2012 16.35
Estonia 2005 15.52
Lithuania 2014 15.19
Lithuania 2015 15.19
Lithuania 2012 15.14
Estonia 2004 15.07
Estonia 2009 15.04
Lithuania 2013 15.04
Estonia 2010 14.97
Estonia 2011 14.97
Estonia 2012 14.97
=== Alcohol BOTTOM 15 ===
Country Year Alcohol
Afghanistan 2000 0.02
Afghanistan 2001 0.02
Afghanistan 2002 0.02
Afghanistan 2003 0.02
Afghanistan 2004 0.02
Afghanistan 2005 0.02
Afghanistan 2007 0.02
Iran (Islamic Republic of) 2000 0.02
Iran (Islamic Republic of) 2001 0.02
Iran (Islamic Republic of) 2002 0.02
Iran (Islamic Republic of) 2003 0.02
Iran (Islamic Republic of) 2004 0.02
Iran (Islamic Republic of) 2005 0.02
Iran (Islamic Republic of) 2006 0.02
Iran (Islamic Republic of) 2007 0.02
=== Schooling TOP 15 ===
Country Year Schooling
Australia 2004 20.7
Australia 2003 20.6
Australia 2001 20.5
Australia 2000 20.4
Australia 2014 20.4
Australia 2015 20.4
Australia 2005 20.3
Australia 2006 20.3
Australia 2013 20.3
New Zealand 2010 20.3
Australia 2002 20.1
Australia 2012 20.1
Australia 2011 19.8
New Zealand 2011 19.7
Australia 2010 19.5
=== Schooling BOTTOM 15 ===
Country Year Schooling
Niger 2000 2.8
Djibouti 2000 2.9
Djibouti 2001 2.9
Niger 2001 2.9
Niger 2002 2.9
Niger 2003 3.0
Niger 2004 3.1
Djibouti 2002 3.3
Burkina Faso 2000 3.4
Burkina Faso 2001 3.5
Djibouti 2003 3.5
Niger 2005 3.5
Burkina Faso 2002 3.6
Djibouti 2004 3.7
Niger 2006 3.7
=== Total expenditure TOP 15 ===
Country Year Total expenditure
United States of America 2011 17.60
Marshall Islands 2013 17.24
United States of America 2010 17.20
United States of America 2012 17.20
United States of America 2014 17.14
United States of America 2015 17.14
United States of America 2009 17.00
United States of America 2013 16.90
Tuvalu 2013 16.61
United States of America 2008 16.20
United States of America 2003 15.60
United States of America 2007 15.57
United States of America 2006 15.27
United States of America 2005 15.15
United States of America 2004 15.14
=== Total expenditure BOTTOM 15 ===
Country Year Total expenditure
Timor-Leste 2007 0.37
Timor-Leste 2006 0.65
Timor-Leste 2008 0.74
Timor-Leste 2011 0.76
Timor-Leste 2010 0.92
Germany 2000 1.10
Timor-Leste 2012 1.10
Austria 2001 1.12
Serbia 2013 1.12
Sierra Leone 2007 1.12
Germany 2001 1.15
Kiribati 2013 1.15
Belgium 2010 1.17
Japan 2012 1.17
Denmark 2008 1.18
=== Income composition of resources TOP 15 ===
Country Year Income composition of resources
Norway 2015 0.948
Norway 2014 0.945
Norway 2013 0.942
Norway 2012 0.941
Norway 2011 0.939
Switzerland 2015 0.938
Australia 2015 0.937
Australia 2014 0.936
Norway 2008 0.936
Norway 2009 0.936
Norway 2010 0.936
Switzerland 2014 0.936
Norway 2007 0.934
Switzerland 2013 0.934
Australia 2013 0.933
=== Income composition of resources BOTTOM 15 ===
Country Year Income composition of resources
Niger 2000 0.253
Niger 2001 0.255
Niger 2002 0.261
Niger 2003 0.266
Burundi 2000 0.268
Burundi 2001 0.268
Burundi 2002 0.268
Niger 2004 0.270
Burundi 2003 0.276
Niger 2005 0.278
Burundi 2004 0.279
Ethiopia 2000 0.283
Ethiopia 2001 0.283
Chad 2003 0.284
Burundi 2005 0.286
df.to_csv('output.csv')
cols = ["BMI","thinness 10-19 years","thinness 5-9 years",
"Diphtheria","Polio","Adult Mortality","Life expectancy"]
for c in cols:
dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
dataframe[["Country","Year","BMI"]].sort_values("BMI", ascending=False).head(15)
| Country | Year | BMI | |
|---|---|---|---|
| 1812 | Nauru | 2013 | 87.3 |
| 1958 | Palau | 2013 | 83.3 |
| 1650 | Marshall Islands | 2013 | 81.6 |
| 2713 | Tuvalu | 2013 | 79.3 |
| 1378 | Kiribati | 2015 | 77.6 |
| 1379 | Kiribati | 2014 | 77.1 |
| 1380 | Kiribati | 2013 | 76.7 |
| 1381 | Kiribati | 2012 | 76.2 |
| 1382 | Kiribati | 2011 | 75.7 |
| 2633 | Tonga | 2015 | 75.2 |
| 1383 | Kiribati | 2010 | 75.2 |
| 2634 | Tonga | 2014 | 74.8 |
| 2200 | Samoa | 2015 | 74.7 |
| 1384 | Kiribati | 2009 | 74.6 |
| 2201 | Samoa | 2014 | 74.3 |
checks = {
"BMI==0": (dataframe["BMI"]==0).sum(),
"BMI==1": (dataframe["BMI"]==1).sum(),
"Thin10==0": (dataframe["thinness 10-19 years"]==0).sum(),
"Thin5==0": (dataframe["thinness 5-9 years"]==0).sum(),
"Diphtheria==0": (dataframe["Diphtheria"]==0).sum(),
"Polio==0": (dataframe["Polio"]==0).sum()
}
print(checks)
{'BMI==0': 0, 'BMI==1': 1, 'Thin10==0': 0, 'Thin5==0': 0, 'Diphtheria==0': 0, 'Polio==0': 0}
print(dataframe.groupby("Year")["BMI"].apply(lambda s: (s==1).sum()).sort_values(ascending=False).head(10))
print(dataframe.groupby("Year")["Diphtheria"].apply(lambda s: (s==0).sum()).sort_values(ascending=False).head(10))
print(dataframe.groupby("Year")["Polio"].apply(lambda s: (s==0).sum()).sort_values(ascending=False).head(10))
Year 2002 1 2000 0 2001 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 Name: BMI, dtype: int64 Year 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 Name: Diphtheria, dtype: int64 Year 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 Name: Polio, dtype: int64
for c in ["BMI","thinness 10-19 years","thinness 5-9 years","Diphtheria","Polio"]:
tmp = dataframe[["Country","Year",c]].dropna().sort_values(["Country","Year"])
tmp["absdiff"] = tmp.groupby("Country")[c].diff().abs()
thresh = tmp["absdiff"].quantile(0.99)
big = tmp[tmp["absdiff"] > thresh].sort_values("absdiff", ascending=False).head(10)
print("\n==", c, "== 99th% absdiff threshold:", round(thresh,4))
print(big[["Country","Year",c,"absdiff"]].to_string(index=False))
== BMI == 99th% absdiff threshold: 54.6
Country Year BMI absdiff
Kiribati 2004 71.4 63.8
Tonga 2008 71.5 63.7
Kuwait 2015 71.4 63.6
Samoa 2008 71.4 63.5
Samoa 2006 7.3 62.4
Tonga 2006 7.1 62.3
Kuwait 2013 7.2 62.3
Kiribati 2003 7.6 62.1
United Arab Emirates 2014 62.4 55.9
Tunisia 2015 61.2 55.0
== thinness 10-19 years == 99th% absdiff threshold: 8.8
Country Year thinness 10-19 years absdiff
Pakistan 2007 2.8 18.2
Pakistan 2012 19.8 17.8
Afghanistan 2002 19.9 17.8
Bangladesh 2005 19.9 17.8
South Africa 2006 1.6 10.0
Namibia 2009 1.9 9.6
Botswana 2003 1.9 9.5
Lesotho 2002 1.6 9.5
Zimbabwe 2001 1.6 9.4
Niger 2010 1.7 9.3
== thinness 5-9 years == 99th% absdiff threshold: 8.8
Country Year thinness 5-9 years absdiff
Bangladesh 2003 2.9 18.2
Pakistan 2009 2.9 18.2
Pakistan 2014 19.8 17.8
Bangladesh 2008 19.9 17.8
Afghanistan 2003 19.9 17.7
South Africa 2008 1.7 10.0
Namibia 2009 1.9 9.5
Lesotho 2002 1.6 9.5
Zimbabwe 2001 1.7 9.5
Botswana 2003 1.8 9.5
== Diphtheria == 99th% absdiff threshold: 83.0
Country Year Diphtheria absdiff
Belarus 2003 5.0 94.0
Belarus 2004 99.0 94.0
Saint Lucia 2001 99.0 92.0
Cabo Verde 2011 9.0 90.0
Solomon Islands 2011 99.0 90.0
Solomon Islands 2007 9.0 90.0
Ghana 2014 98.0 89.0
Ukraine 2008 9.0 89.0
Swaziland 2015 9.0 89.0
Peru 2001 9.0 89.0
== Polio == 99th% absdiff threshold: 84.0
Country Year Polio absdiff
Saint Lucia 2001 99.0 92.0
Comoros 2002 98.0 91.0
Solomon Islands 2006 99.0 90.0
Cabo Verde 2011 9.0 90.0
Saint Lucia 2002 9.0 90.0
Comoros 2003 8.0 90.0
Belarus 2008 98.0 89.0
Belarus 2007 9.0 88.0
Kenya 2011 97.0 88.0
Ecuador 2004 9.0 88.0
spike_countries = ["Kiribati", "Tonga", "Kuwait", "Samoa", "United Arab Emirates", "Tunisia"]
tmp = dataframe.loc[dataframe["Country"].isin(spike_countries), ["Country","Year","BMI"]].copy()
tmp["BMI"] = pd.to_numeric(tmp["BMI"], errors="coerce")
for country in spike_countries:
s = tmp[tmp["Country"] == country].sort_values("Year")
plt.figure(figsize=(8,3))
plt.plot(s["Year"], s["BMI"], marker="o")
plt.title(f"BMI over time: {country}")
plt.xlabel("Year")
plt.ylabel("BMI")
plt.grid(True, alpha=0.3)
plt.show()
bmi = pd.to_numeric(dataframe["BMI"], errors="coerce")
print("BMI < 10:", int((bmi < 10).sum()))
print("BMI > 60:", int((bmi > 60).sum()))
print("Top high BMI rows:")
print(dataframe.loc[bmi > 60, ["Country","Year","BMI"]].sort_values("BMI", ascending=False).head(30).to_string(index=False))
print("\nTop low BMI rows:")
print(dataframe.loc[bmi < 10, ["Country","Year","BMI"]].sort_values("BMI").head(30).to_string(index=False))
plt.figure(figsize=(8,4))
plt.hist(bmi.dropna(), bins=60)
plt.title("BMI distribution (watch for weird mass <10 or >60)")
plt.xlabel("BMI")
plt.ylabel("Count")
plt.show()
BMI < 10: 281
BMI > 60: 351
Top high BMI rows:
Country Year BMI
Nauru 2013 87.3
Palau 2013 83.3
Marshall Islands 2013 81.6
Tuvalu 2013 79.3
Kiribati 2015 77.6
Kiribati 2014 77.1
Kiribati 2013 76.7
Kiribati 2012 76.2
Kiribati 2011 75.7
Tonga 2015 75.2
Kiribati 2010 75.2
Tonga 2014 74.8
Samoa 2015 74.7
Kiribati 2009 74.6
Tonga 2013 74.3
Samoa 2014 74.3
Kiribati 2008 74.1
Samoa 2013 73.8
Tonga 2012 73.8
Kiribati 2007 73.4
Samoa 2012 73.4
Tonga 2011 73.3
Samoa 2011 72.9
Kiribati 2006 72.8
Tonga 2010 72.7
Samoa 2010 72.5
Tonga 2009 72.1
Kiribati 2005 72.1
Samoa 2009 72.0
Tonga 2008 71.5
Top low BMI rows:
Country Year BMI
Viet Nam 2002 1.0
Viet Nam 2003 1.4
Bangladesh 2000 1.4
Bangladesh 2001 1.8
Viet Nam 2004 1.9
Madagascar 2014 2.0
Rwanda 2013 2.1
Philippines 2005 2.1
Comoros 2007 2.1
Mozambique 2009 2.1
Democratic Republic of the Congo 2012 2.1
Benin 2004 2.1
Pakistan 2007 2.1
Lao People's Democratic Republic 2013 2.1
Kenya 2012 2.1
Guinea-Bissau 2005 2.1
Ghana 2001 2.1
Thailand 2002 2.2
Liberia 2000 2.2
Mali 2009 2.2
United Republic of Tanzania 2009 2.2
Sierra Leone 2007 2.2
Zambia 2009 2.2
Central African Republic 2010 2.2
Equatorial Guinea 2005 2.2
Maldives 2008 2.3
Gambia 2004 2.3
Congo 2002 2.3
Bhutan 2010 2.3
Guinea 2009 2.3
df2 = dataframe.copy()
df2["BMI"] = pd.to_numeric(df2["BMI"], errors="coerce")
tmp = df2[["Country","Year","BMI"]].dropna().sort_values(["Country","Year"])
tmp["absdiff"] = tmp.groupby("Country")["BMI"].diff().abs()
# show the worst jumps
worst = tmp.sort_values("absdiff", ascending=False).head(50)
print(worst[["Country","Year","BMI","absdiff"]].to_string(index=False))
# list countries that have a huge jump (tune threshold if you want)
spike_countries = tmp.loc[tmp["absdiff"] > 30, "Country"].unique()
print("\nCountries with BMI jump > 30:", len(spike_countries))
print(list(spike_countries)[:50])
Country Year BMI absdiff
Kiribati 2004 71.4 63.8
Tonga 2008 71.5 63.7
Kuwait 2015 71.4 63.6
Samoa 2008 71.4 63.5
Samoa 2006 7.3 62.4
Tonga 2006 7.1 62.3
Kuwait 2013 7.2 62.3
Kiribati 2003 7.6 62.1
United Arab Emirates 2014 62.4 55.9
Tunisia 2015 61.2 55.0
Fiji 2013 61.1 54.9
Libya 2012 61.8 54.9
Turkey 2009 61.1 54.9
Egypt 2015 61.1 54.9
Jordan 2010 61.7 54.8
United States of America 2002 61.7 54.8
Ireland 2013 61.3 54.8
Saudi Arabia 2007 61.6 54.7
Mexico 2012 61.5 54.7
Poland 2014 61.1 54.7
Portugal 2015 61.6 54.7
Greece 2006 61.2 54.7
Bahrain 2012 61.5 54.7
Croatia 2011 61.3 54.7
Canada 2005 61.3 54.7
Cuba 2015 61.4 54.7
Belarus 2013 61.1 54.6
Spain 2006 61.1 54.6
Chile 2011 61.2 54.6
New Zealand 2004 61.5 54.6
Lebanon 2006 61.4 54.6
Venezuela (Bolivarian Republic of) 2013 61.0 54.6
Argentina 2012 61.0 54.6
Bulgaria 2008 61.5 54.6
Hungary 2009 61.1 54.6
Australia 2005 61.5 54.6
United Kingdom of Great Britain and Northern Ireland 2006 61.3 54.6
Ukraine 2015 61.3 54.6
Montenegro 2014 61.3 54.6
Bahamas 2010 61.3 54.6
Germany 2013 61.4 54.5
Netherlands 2013 61.0 54.5
France 2012 61.1 54.5
Israel 2006 61.1 54.5
Italy 2010 61.0 54.5
Uruguay 2010 61.2 54.5
Lithuania 2013 61.4 54.5
Latvia 2015 61.2 54.5
Czechia 2005 61.3 54.5
Norway 2015 61.2 54.4
Countries with BMI jump > 30: 108
['Albania', 'Algeria', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina', 'Brazil', 'Brunei Darussalam', 'Bulgaria', 'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Fiji', 'Finland', 'France', 'Georgia', 'Germany', 'Greece', 'Grenada', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica']
dataframe = dataframe.drop(columns=["BMI"], errors="ignore")
dataframe = dataframe.drop(columns=["thinness 5-9 years"], errors="ignore")
for c in ["Polio", "Diphtheria"]:
dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
dataframe[c] = dataframe[c].clip(0, 100)
cols = ["Polio", "Diphtheria"]
for c in cols:
dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
print("Missing %:")
print((dataframe[cols].isna().mean()*100).round(3))
for c in cols:
plt.figure(figsize=(8,4))
plt.hist(dataframe[c].dropna(), bins=40)
plt.title(f"{c} distribution (before imputation)")
plt.xlabel(c)
plt.ylabel("Count")
plt.show()
Missing %: Polio 0.651 Diphtheria 0.651 dtype: float64
for c in ["Polio","Diphtheria"]:
s = pd.to_numeric(dataframe[c], errors="coerce")
print("\n==", c, "==")
print("min:", float(s.min()), "max:", float(s.max()))
print("<10 count:", int((s < 10).sum()))
print("<10 %:", round((s < 10).mean()*100, 3))
print("==0 count:", int((s == 0).sum()))
== Polio == min: 3.0 max: 99.0 <10 count: 167 <10 %: 5.719 ==0 count: 0 == Diphtheria == min: 2.0 max: 99.0 <10 count: 166 <10 %: 5.685 ==0 count: 0
low = dataframe.loc[pd.to_numeric(dataframe["Polio"], errors="coerce") < 10,
["Country","Year","Polio","Diphtheria","Status"]].copy()
low["Polio"] = pd.to_numeric(low["Polio"], errors="coerce")
print("Low Polio rows:", len(low))
print(low.sort_values(["Polio","Country","Year"]).head(40).to_string(index=False))
Low Polio rows: 167
Country Year Polio Diphtheria Status
Angola 2000 3.0 28.0 Developing
Chad 2000 3.0 36.0 Developing
Chad 2008 3.0 19.0 Developing
Democratic Republic of the Congo 2001 3.0 3.0 Developing
Equatorial Guinea 2012 3.0 24.0 Developing
Equatorial Guinea 2013 3.0 3.0 Developing
Angola 2003 4.0 4.0 Developing
Angola 2004 4.0 4.0 Developing
Central African Republic 2001 4.0 4.0 Developing
Chad 2011 4.0 33.0 Developing
Democratic Republic of the Congo 2002 4.0 38.0 Developing
Niger 2011 4.0 75.0 Developing
Nigeria 2002 4.0 25.0 Developing
Afghanistan 2004 5.0 5.0 Developing
Congo 2003 5.0 5.0 Developing
Equatorial Guinea 2005 5.0 39.0 Developing
Haiti 2000 5.0 41.0 Developing
Lao People's Democratic Republic 2005 5.0 49.0 Developing
South Sudan 2013 5.0 45.0 Developing
Syrian Arab Republic 2013 5.0 41.0 Developing
Syrian Arab Republic 2015 5.0 41.0 Developing
Afghanistan 2015 6.0 65.0 Developing
Democratic Republic of the Congo 2005 6.0 6.0 Developing
Guinea 2009 6.0 57.0 Developing
Haiti 2005 6.0 6.0 Developing
Lao People's Democratic Republic 2008 6.0 61.0 Developing
Madagascar 2001 6.0 6.0 Developing
Nigeria 2008 6.0 53.0 Developing
Samoa 2011 6.0 65.0 Developing
Senegal 2002 6.0 6.0 Developing
Sudan 2002 6.0 6.0 Developing
Syrian Arab Republic 2011 6.0 72.0 Developing
Angola 2015 7.0 64.0 Developing
Comoros 2000 7.0 7.0 Developing
Comoros 2001 7.0 7.0 Developing
Côte d'Ivoire 2001 7.0 66.0 Developing
Côte d'Ivoire 2002 7.0 64.0 Developing
Ethiopia 2011 7.0 65.0 Developing
Ethiopia 2012 7.0 69.0 Developing
Ethiopia 2013 7.0 72.0 Developing
low = dataframe.loc[pd.to_numeric(dataframe["Diphtheria"], errors="coerce") < 10,
["Country","Year","Polio","Diphtheria","Status"]].copy()
low["Diphtheria"] = pd.to_numeric(low["Diphtheria"], errors="coerce")
print("Low Diphtheria rows:", len(low))
print(low.sort_values(["Diphtheria","Country","Year"]).head(40).to_string(index=False))
Low Diphtheria rows: 166
Country Year Polio Diphtheria Status
Equatorial Guinea 2014 24.0 2.0 Developing
Democratic Republic of the Congo 2001 3.0 3.0 Developing
Equatorial Guinea 2013 3.0 3.0 Developing
Ethiopia 2000 55.0 3.0 Developing
Angola 2003 4.0 4.0 Developing
Angola 2004 4.0 4.0 Developing
Central African Republic 2001 4.0 4.0 Developing
Chad 2006 49.0 4.0 Developing
Chad 2012 51.0 4.0 Developing
Democratic Republic of the Congo 2000 42.0 4.0 Developing
Equatorial Guinea 2006 52.0 4.0 Developing
Ethiopia 2004 54.0 4.0 Developing
Nigeria 2006 46.0 4.0 Developing
Afghanistan 2004 5.0 5.0 Developing
Belarus 2003 53.0 5.0 Developing
Congo 2003 5.0 5.0 Developing
Ethiopia 2007 61.0 5.0 Developing
Guinea 2001 52.0 5.0 Developing
Lao People's Democratic Republic 2007 46.0 5.0 Developing
Liberia 2014 49.0 5.0 Developing
Togo 2001 51.0 5.0 Developing
Ukraine 2011 54.0 5.0 Developing
Venezuela (Bolivarian Republic of) 2008 76.0 5.0 Developing
Angola 2009 63.0 6.0 Developing
Cambodia 2001 59.0 6.0 Developing
Democratic Republic of the Congo 2005 6.0 6.0 Developing
Democratic Republic of the Congo 2010 76.0 6.0 Developing
Guinea 2004 65.0 6.0 Developing
Guinea 2008 59.0 6.0 Developing
Guinea-Bissau 2003 65.0 6.0 Developing
Haiti 2005 6.0 6.0 Developing
Haiti 2006 61.0 6.0 Developing
Haiti 2015 56.0 6.0 Developing
Liberia 2005 66.0 6.0 Developing
Liberia 2006 66.0 6.0 Developing
Madagascar 2001 6.0 6.0 Developing
Philippines 2015 79.0 6.0 Developing
Senegal 2002 6.0 6.0 Developing
Sudan 2002 6.0 6.0 Developing
Benin 2005 73.0 7.0 Developing
low = dataframe.loc[pd.to_numeric(dataframe["Polio"], errors="coerce") < 10,
["Country","Year","Polio","Diphtheria","Status"]].copy()
low["Polio"] = pd.to_numeric(low["Polio"], errors="coerce")
print("Max Polio in low set:", low["Polio"].max())
print(low.sort_values("Polio").head(20).to_string(index=False))
Max Polio in low set: 9.0
Country Year Polio Diphtheria Status
Democratic Republic of the Congo 2001 3.0 3.0 Developing
Angola 2000 3.0 28.0 Developing
Chad 2000 3.0 36.0 Developing
Equatorial Guinea 2012 3.0 24.0 Developing
Equatorial Guinea 2013 3.0 3.0 Developing
Chad 2008 3.0 19.0 Developing
Democratic Republic of the Congo 2002 4.0 38.0 Developing
Central African Republic 2001 4.0 4.0 Developing
Chad 2011 4.0 33.0 Developing
Nigeria 2002 4.0 25.0 Developing
Niger 2011 4.0 75.0 Developing
Angola 2004 4.0 4.0 Developing
Angola 2003 4.0 4.0 Developing
Congo 2003 5.0 5.0 Developing
Equatorial Guinea 2005 5.0 39.0 Developing
Lao People's Democratic Republic 2005 5.0 49.0 Developing
Haiti 2000 5.0 41.0 Developing
Afghanistan 2004 5.0 5.0 Developing
Syrian Arab Republic 2013 5.0 41.0 Developing
South Sudan 2013 5.0 45.0 Developing
df2 = dataframe.sort_values(["Country", "Year"]).copy()
for c in ["Polio", "Diphtheria"]:
df2[c] = df2.groupby("Country")[c].transform(lambda s: s.interpolate(limit_direction="both"))
df2[c] = df2.groupby("Country")[c].transform(lambda s: s.fillna(s.median()))
df2[c] = df2[c].fillna(df2[c].median())
print("Missing % after Polio/Diphtheria imputation:")
print((df2[["Polio","Diphtheria"]].isna().mean()*100).round(4))
dataframe = df2
Missing % after Polio/Diphtheria imputation: Polio 0.0 Diphtheria 0.0 dtype: float64
tmp = dataframe[["Life expectancy","Polio","Diphtheria"]].copy()
for c in tmp.columns:
tmp[c] = pd.to_numeric(tmp[c], errors="coerce")
print(tmp.corr(numeric_only=True)["Life expectancy"].sort_values(ascending=False))
Life expectancy 1.000000 Diphtheria 0.464856 Polio 0.449946 Name: Life expectancy, dtype: float64
col = "thinness 10-19 years"
dataframe[col] = pd.to_numeric(dataframe[col], errors="coerce")
print("Missing %:", dataframe[col].isna().mean()*100)
print(dataframe[col].describe())
plt.figure(figsize=(8,4))
plt.hist(dataframe[col].dropna(), bins=40)
plt.title("thinness 10-19 years distribution")
plt.xlabel(col)
plt.ylabel("Count")
plt.show()
Missing %: 1.1643835616438356 count 2886.000000 mean 4.829522 std 4.428383 min 0.100000 25% 1.600000 50% 3.300000 75% 7.175000 max 27.700000 Name: thinness 10-19 years, dtype: float64
df_sorted = dataframe.sort_values(["Country","Year"]).copy()
df_sorted[col] = pd.to_numeric(df_sorted[col], errors="coerce")
tmp = df_sorted[["Country","Year",col]].dropna().copy()
tmp["absdiff"] = tmp.groupby("Country")[col].diff().abs()
thr = tmp["absdiff"].quantile(0.99)
spikes = tmp[tmp["absdiff"] > thr].sort_values("absdiff", ascending=False)
print("99th% jump threshold:", thr)
print("Top spikes:")
print(spikes.head(25).to_string(index=False))
99th% jump threshold: 8.8
Top spikes:
Country Year thinness 10-19 years absdiff
Pakistan 2007 2.8 18.2
Pakistan 2012 19.8 17.8
Afghanistan 2002 19.9 17.8
Bangladesh 2005 19.9 17.8
South Africa 2006 1.6 10.0
Namibia 2009 1.9 9.6
Botswana 2003 1.9 9.5
Lesotho 2002 1.6 9.5
Zimbabwe 2001 1.6 9.4
Niger 2010 1.7 9.3
Nigeria 2012 1.7 9.3
Burkina Faso 2003 1.7 9.3
Democratic Republic of the Congo 2008 1.8 9.3
Mali 2001 1.8 9.2
Chad 2003 1.9 9.2
Senegal 2008 1.8 9.2
Timor-Leste 2014 1.9 9.2
Ethiopia 2011 1.9 9.1
Indonesia 2003 1.9 9.1
Cambodia 2014 1.9 9.1
Eritrea 2002 9.9 8.9
Democratic Republic of the Congo 2013 9.9 8.9
Central African Republic 2004 9.9 8.9
Senegal 2013 9.9 8.9
Lao People's Democratic Republic 2006 9.9 8.9
for country in spikes["Country"].head(5).unique():
s = df_sorted[df_sorted["Country"]==country][["Year",col]].sort_values("Year")
plt.figure(figsize=(9,3))
plt.plot(s["Year"], s[col], marker="o")
plt.title(f"{col} over time (spike check): {country}")
plt.xlabel("Year"); plt.ylabel(col)
plt.grid(True, alpha=0.3)
plt.show()
“Thinness is measured per country-year; values are generally smooth but can have regime shifts likely due to data source/measurement changes.”
“We did not modify observed values; we only imputed missing values using country-wise interpolation, preserving each country’s trajectory.”
df2 = dataframe.sort_values(["Country","Year"]).copy()
df2[col] = pd.to_numeric(df2[col], errors="coerce")
df2.loc[(df2[col] < 0) | (df2[col] > 50), col] = np.nan
df2[col] = df2.groupby("Country")[col].transform(lambda s: s.interpolate(limit_direction="both"))
df2[col] = df2.groupby("Country")[col].transform(lambda s: s.fillna(s.median()))
df2[col] = df2[col].fillna(df2[col].median())
print("Missing % after thinness impute:", df2[col].isna().mean()*100)
dataframe = df2
Missing % after thinness impute: 0.0
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims)
dataframe = dataframe.drop(columns=["Adult Mortality"], errors="ignore")
y = "Life expectancy"
dataframe[y] = pd.to_numeric(dataframe[y], errors="coerce")
missing_le = dataframe[dataframe[y].isna()]
print("Missing Life expectancy rows:", len(missing_le))
Missing Life expectancy rows: 8
countries = missing_le["Country"].value_counts()
print("Countries with missing Life expectancy:", len(countries))
print(countries.head(30)) # top 30 countries by missing count
print("\nAll affected countries:")
print(countries.index.tolist())
Countries with missing Life expectancy: 8 Country Dominica 1 Marshall Islands 1 Monaco 1 Nauru 1 Palau 1 Saint Kitts and Nevis 1 San Marino 1 Tuvalu 1 Name: count, dtype: int64 All affected countries: ['Dominica', 'Marshall Islands', 'Monaco', 'Nauru', 'Palau', 'Saint Kitts and Nevis', 'San Marino', 'Tuvalu']
dataframe = dataframe[dataframe["Life expectancy"].notna()].copy()
dataframe = dataframe.drop("Country_wb",axis = 1)
Feature Engineering¶
Feature Engineering je metoda koju primenjujemo nad podacima posmatranog skupa podataka. Ideja je da se kroz kombinovanje, transformaciju ili restrukturiranje postojećih promenljivih izvuče dodatna informacija koja nije eksplicitno sadržana u originalnim podacima. Formiranjem novih promenljivih omogućavamo modelu da lakše prepozna obrasce i odnose u podacima, čime se može poboljšati prediktivna moć modela.
Pre nego što započnemo preformulisanje naših promenljivih, sagledajmo koje promenljive dataseta smo ostavili (Prethodno smo izbacili "BMI" zbog velikih nelogičnosti, kao i "thinness 1-9 years" usled idetnične korelacije i raspodele sa drugom thinness promenljivom)
dataframe.head()
| Country | Year | Life expectancy | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | under-five deaths | Polio | ... | Diphtheria | HIV/AIDS | GDP | Population | thinness 10-19 years | Income composition of resources | Schooling | Status_Developing | immunization_index | log_thinness 10-19 years | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | Afghanistan | 2000 | 54.8 | 88 | 0.02 | 10.424960 | 62.0 | 6532 | 122 | 24.0 | ... | 24.0 | 0.1 | 114.560000 | 29375600.0 | 2.3 | 0.338 | 5.5 | 1 | 36.666667 | 1.193922 |
| 14 | Afghanistan | 2001 | 55.3 | 88 | 0.02 | 10.574728 | 63.0 | 8762 | 122 | 35.0 | ... | 33.0 | 0.1 | 117.496980 | 29664630.0 | 2.1 | 0.340 | 5.9 | 1 | 43.666667 | 1.131402 |
| 13 | Afghanistan | 2002 | 56.2 | 88 | 0.02 | 16.887351 | 64.0 | 2486 | 122 | 36.0 | ... | 36.0 | 0.1 | 187.845950 | 21979923.0 | 19.9 | 0.341 | 6.2 | 1 | 45.333333 | 3.039749 |
| 12 | Afghanistan | 2003 | 56.7 | 87 | 0.02 | 11.089053 | 65.0 | 798 | 122 | 41.0 | ... | 41.0 | 0.1 | 198.728544 | 23648510.0 | 19.7 | 0.373 | 6.5 | 1 | 49.000000 | 3.030134 |
| 11 | Afghanistan | 2004 | 57.0 | 87 | 0.02 | 15.296066 | 67.0 | 466 | 120 | 5.0 | ... | 5.0 | 0.1 | 219.141353 | 24118979.0 | 19.5 | 0.381 | 6.8 | 1 | 25.666667 | 3.020425 |
5 rows × 21 columns
Najpre, pošto smo sagledali ogromnu značajnost promenljive Status gde nam je što empirijski poznato što zbog domenskog znanja da stanovnici razvijene Zemlje imaju duži životni vek, te ćemo ovu promenljivu kodirati u True i False labele.
dataframe = pd.get_dummies(
dataframe,
columns=["Status"],
drop_first=True
)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[161], line 1 ----> 1 dataframe = pd.get_dummies( 2 dataframe, 3 columns=["Status"], 4 drop_first=True 5 ) File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\reshape\encoding.py:170, in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype) 168 raise TypeError("Input must be a list-like for parameter `columns`") 169 else: --> 170 data_to_encode = data[columns] 172 # validate prefixes and separator to avoid silently dropping cols 173 def check_len(item, name: str): File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\frame.py:4119, in DataFrame.__getitem__(self, key) 4117 if is_iterator(key): 4118 key = list(key) -> 4119 indexer = self.columns._get_indexer_strict(key, "columns")[1] 4121 # take() does not accept boolean indexers 4122 if getattr(indexer, "dtype", None) == bool: File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6212, in Index._get_indexer_strict(self, key, axis_name) 6209 else: 6210 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 6212 self._raise_if_missing(keyarr, indexer, axis_name) 6214 keyarr = self.take(indexer) 6215 if isinstance(key, Index): 6216 # GH 42790 - Preserve name from an Index File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6261, in Index._raise_if_missing(self, key, indexer, axis_name) 6259 if nmissing: 6260 if nmissing == len(indexer): -> 6261 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 6263 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) 6264 raise KeyError(f"{not_found} not in index") KeyError: "None of [Index(['Status'], dtype='object')] are in the [columns]"
dataframe["Status_Developing"] = dataframe["Status_Developing"].astype(int)
Zatim možemo posmatrati pokrivenost imunizacije po državama, pošto promenljive "Diphteria", "Polio" i "Hepatitis B" sve predstavljaju procentualnu imunizaciju gradjana, možemo uzeti prosek ovih promenljivih i tako ih posmatrati na nivou države.
dataframe["immunization_index"] = (
dataframe["Hepatitis B"] +
dataframe["Polio"] +
dataframe["Diphtheria"]
) / 3
Takodje iz ranije prikazanih grafova, činio se kao očigledno dobar izbor da se više promenljivih predstave preko logaritamskih transformacija. Promenljive koje su se činile kao veoma dobar izbor za ovo, iz razloga što je njihova raspodela bila jako desno asimetrična (većina podataka se grupisalo u klaster sa leve strane), su "GDP", "infant deaths", "HIV/AIDS".
features_log = ["GDP","HIV/AIDS","infant deaths"]
for feature in features_log:
dataframe[f"log_{col}"] = np.log1p(dataframe[col])
dataframe.head()
| Country | Year | Life expectancy | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | under-five deaths | Polio | ... | HIV/AIDS | GDP | Population | thinness 10-19 years | Income composition of resources | Schooling | Country_wb | Status_Developing | immunization_index | log_thinness 10-19 years | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | Afghanistan | 2000 | 54.8 | 88 | 0.02 | 10.424960 | 62.0 | 6532 | 122 | 24.0 | ... | 0.1 | 114.560000 | 29375600.0 | 2.3 | 0.338 | 5.5 | Afghanistan | 1 | 36.666667 | 1.193922 |
| 14 | Afghanistan | 2001 | 55.3 | 88 | 0.02 | 10.574728 | 63.0 | 8762 | 122 | 35.0 | ... | 0.1 | 117.496980 | 29664630.0 | 2.1 | 0.340 | 5.9 | Afghanistan | 1 | 43.666667 | 1.131402 |
| 13 | Afghanistan | 2002 | 56.2 | 88 | 0.02 | 16.887351 | 64.0 | 2486 | 122 | 36.0 | ... | 0.1 | 187.845950 | 21979923.0 | 19.9 | 0.341 | 6.2 | Afghanistan | 1 | 45.333333 | 3.039749 |
| 12 | Afghanistan | 2003 | 56.7 | 87 | 0.02 | 11.089053 | 65.0 | 798 | 122 | 41.0 | ... | 0.1 | 198.728544 | 23648510.0 | 19.7 | 0.373 | 6.5 | Afghanistan | 1 | 49.000000 | 3.030134 |
| 11 | Afghanistan | 2004 | 57.0 | 87 | 0.02 | 15.296066 | 67.0 | 466 | 120 | 5.0 | ... | 0.1 | 219.141353 | 24118979.0 | 19.5 | 0.381 | 6.8 | Afghanistan | 1 | 25.666667 | 3.020425 |
5 rows × 22 columns
Data preprocessing¶
Pre nego što se upustimo u feature selection i odabir najboljih promenljivih za naš model, želimo da pretprocesiramo podatke tako da naš model što efikasnije barata sa njima, a ujedno možemo i da raspodelimo podatke na tri skupa:
Trening skupovo je skup podataka koji model koristi pri treniranju, odnosno skup podataka za koji model pravi predikciju, računa grešku i koriguje se tako što promeni parametre koje koristi pri predikciji.Validacioni skupovo je skup podataka koje model koristi da nakon treniranja sagleda koliko je naučio, pravi predikcije i računa metrije nad validacionim skupom kako bi sagledali kako se model ponaša kada vidi nove podatke, ujedno sagledamo razlike metrike nad validacionim i trening skupom kako bi uočili da li postoji overfit.Test skupovo je skup podataka koje model vidi kada u potpunosti završi sa treniranjem, to su podaci koje model nikada ranije nije video i služe kao pravo merilo uspešnosti modela.
Pre nego što podelimo naše podatke na ova tri skupa, neophodno je da razdvojimo ciljanu promenljivu "Life expectancy" od trening skupa kako model ne bi imao uvod u ono šta predvidja, ujedno moramo i skalirati podatke koje pripadaju trening skupu. Ovo radimo pomoću StandardScaler bibloleteke koja na osnovu srednje vrednosti i standardne devijacije skalira podatke čime dobijamo da svi podaci budu na jednoj istoj skali i smanjujemo dominaciju outliera.
X = dataframe.drop(["Life expectancy","Country"], axis = 1)
y = dataframe["Life expectancy"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
X_scaled["Status_Developing"] = X["Status_Developing"]
X_scaled.head()
| Year | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | under-five deaths | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 10-19 years | Income composition of resources | Schooling | Status_Developing | immunization_index | log_thinness 10-19 years | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.626978 | 0.487289 | -1.259174 | -0.367848 | -0.750392 | 0.358130 | 0.496818 | -2.512290 | 0.909136 | -2.471073 | -0.324177 | -0.593821 | 0.016338 | -0.571880 | -2.012005 | -2.134521 | 1.0 | -2.230649 | -0.447598 |
| 1 | -1.410048 | 0.487289 | -1.259174 | -0.367773 | -0.709743 | 0.551926 | 0.496818 | -2.040647 | 0.748901 | -2.090006 | -0.324177 | -0.593633 | 0.019053 | -0.617262 | -1.999361 | -2.005207 | 1.0 | -1.884723 | -0.536001 |
| 2 | -1.193118 | 0.487289 | -1.259174 | -0.364609 | -0.669094 | 0.006517 | 0.496818 | -1.997770 | 0.732877 | -1.962984 | -0.324177 | -0.589120 | -0.053120 | 3.421734 | -1.993039 | -1.908222 | 1.0 | -1.802359 | 2.162364 |
| 3 | -0.976187 | 0.478843 | -1.259174 | -0.367515 | -0.628445 | -0.140177 | 0.496818 | -1.783387 | 1.157501 | -1.751280 | -0.324177 | -0.588421 | -0.037449 | 3.376352 | -1.790731 | -1.811236 | 1.0 | -1.621160 | 2.148768 |
| 4 | -0.759257 | 0.478843 | -1.259174 | -0.365406 | -0.547146 | -0.169029 | 0.484402 | -3.326947 | 1.145484 | -3.275546 | -0.324177 | -0.587112 | -0.033031 | 3.330970 | -1.740154 | -1.714251 | 1.0 | -2.774246 | 2.135040 |
Prikazom ovih podataka, vidimo da su svi podaci uspešno skalirani normalizacijom. Sada možemo podeliti podatke na train-validation-test skupove.
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.3,random_state=42,stratify=X_scaled["Status_Developing"])
X_val,X_test,y_val,y_test = train_test_split(X_temp,y_temp,test_size = 0.5,random_state = 42,stratify=X_temp["Status_Developing"])
Train test split smo odradili sa opcijom stratify iza kojeg je ideja da raspodelimo ova dva skupa podataka tako da imaju jednaku proporciju primera gde je status 1/0 odnosno Developing/Developed, takodje uzimamo da veličina testnog skupa bude 20% ukupne veličine skupa podataka.
Feature Selection i procena modela¶
Feature Selection predstavlja suštinu celokupnog procesa obavljenog nad skupom podataka. Ovom metodom biramo promenljive koje će model koristiti za predikcije, promenljive biramo na osnovu svih zaključaka koje smo dobili kroz sve prethodne metode, gde je cilj da zadržimo samo releveantne karakteristike uz pomoć kojih model dobija informacije, a da se irelevantne ili visoko korelisane promenljive odbace.
Matrica korelacije¶
plt.figure(figsize=(14, 10))
numeric_data = dataframe.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_data.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Sagledavši matricu korelacije i na osnovu zaključaka sa prethodnih grafova za sledeće promenljive se odlučujemo da se sigurno neće razmatrati u feature selection-u:
percentage expenditure: pošto je u jakoj korelaciji sa GDP-om, što ukazuje na multikolinearnost, dodatno na osnovu domenskog znanja očekujemo da države koje imaju velik GDP imaju i veći life expectancy zato što gradjani žive komfornije živote i pre svega zato što imaju bolji zdravstveni sistem.under-five deaths: vidimo korelaciju 1, što ukazuje na sigurnu multikolinearnost i znamo da obe promenljive opisuju veoma sličnu stvar (najveći broj preminule dece ispod 5 godina pripada starosnoj dobi novorodjenčadi) koja bi dovela do haosa sa težinama modela.Popultaion: na osnovu grafika i matrice korelacije (Life Expectancy ~ Population = -0.03) je veoma jasno da korelacija praktično ne postoji.Income composition of resources: Gotovo sigurna multikolinearnost sa Schooling, u ovom slučaju možemo izabrati bilo koju od ove dve promenljive, ali biramo Schooling pošto je lakša za interpretaciju.Country: Pošto ima isuviše jedinstvenih vrednosti ne možemo koristiti ovu promenljivu, primenom one-hot encodinga bi dobili previše novih kolona i zakomplikovali model.Year: Ne postoji dovoljan broj godina da uhvatimo očigledan trend za države na osnovu godina.
Forward-selection¶
Sada ćemo primeniti metodu forward-selection. Ideja iza ove metode je da napravimo početni model koji sadrži samo jednu promenljivu za koju smatramo da može objasniti najveći udeo varijabilnosti promenljive "Life expectancy", zatim da propratimo predikcije koje dobijamo takvim modelom i uporediti metrikama RSE MAE i R^2. Promenljive koje koristimo kao prediktore modela će biti isključivo iz EDA koje smo smatrali da dobro opisuju ciljanu promenljivu i Feature Engineering odeljka.
Promenljiva koju prvo biramo za naš model će biti GDP, pošto se sa grafika videla jasna korelacija GDP-a države i očekivanog životnog veka.
model = LinearRegression()
features =[
"GDP",
"Schooling",
"infant deaths",
"thinness 10-19 years",
"Status_Developing",
"Alcohol",
"immunization_index",
"HIV/AIDS"
]
target = "Life expectancy"
Pošto forward-selection predstavlja iterativni proces, napravićemo listu promenljivih koje želimo da uključimo u forward selection, i posmatrati poboljšanja pri dodavanju svake od ovih promenljivih. Posmatrane promenljive su
GDP: Korelacija srednje jacine sa Life Expectancy, na osnovu domenskog znanja je veoma logično izabrati ovu promenljivu pošto iziskuje da države sa visokim GDPom imaju visoke životne standarde ali pre svega je smisleno pretpostaviti i da ulažu dosta novca u zdravstveni sistem.Schooling: Jaka korelacija sa Life Expectancy, takodje na osnovu grafika je bilo veoma prominentno da razvijene države imaju visok nivo edukacije što iziskuje i visoko očekivanje životnog veka te populacije. Ovime se takodje naznačava da viši nivo edukacije pored toga što doprinosi više opcija pojedinicu, doprinosi i da pojedinac čuje više različitih mišljenja ali i da ima veću svest o bitnosti redovnih i sistematskih pregleda, kao i svest o tome šta bi prvi siptomi koje iskusi mogli da naznače.infant deaths: Iako nema jaku korelaciju, smisleno je odabrati ovu promenljivu pošto najčešće ukazuje na probleme sa zdravstvenim sistemom, pregledima i brizi o novorodjenčadima kao i moguće prisustvo odredjenih bolesti ili epidemija.thinness 10-19 years: Ova promenljiva uparena sa Status promenljivom iziskivala je da razvijene države imaju nizak broj mršavosti adolescenata/tinejdžera, što ukazuje na prikladnu ishranu, dostupnost hrane i najčešće u slučaju ovih država svest o bitnosti pravilne i normalne ishrane.Status_Developing: Kategorijska promenljiva za koju empirijski znamo (kroz niz grafika) a i na osnovu domenskog znanja da ima jak uticaj na ciljanu promenljivu. Možemo napraviti sličan komentar kao za GDP, da razvijene zemlje ujedno imaju i razvijeno zdravstvo, pobudjenu svest o zdravom životu i slično.Alcohol: Iako je na graficima bilo prisutno vidjenje da razvijene države imaju visoku konzumpciju alkohola, smisleno je da te države ujedno i dobro balansiraju ovaj faktor uz pomoć jakog zdravstvenog sistema i vidimo da možda propagiraju norme bezbednijeg konzumiranja alkohola. Ujedno je prisutna ideja i da u razvijenim državama ljudi piju alkohol češće ali u manjim količinama, posebno zato što se propagira da čaša vina uveče posle posla može doneti i zdravstvene benefite.immunization_index: Promenljiva koja opisuje koliko jedna država ima jaku imunizaciju, potkrepljena je nivoom svesti o zdravstvu naručito pošto postoje osobe koje su ubedjene da vakcinacija ne doprinosi ničemu već da služi kako bi državni organi menjali RNK ljudi, ubacivali nano čipove i ostale apsurdnosti. Sa druge strane spektruma, može oslikati siromaštvo zemalja, naječešće kod u potpunosti nerazvijenih zemalja (nažalost najčešće na afričkom kontinentu) koji jedva da imaju protokole vakcinacije i veoma retke sistematske preglede.HIV/AIDS: Promenljiva ima solidnu korelaciju sa Life Expectancy promenljivom, na osnovu domenskog znanja je ponovo jako smisleno odabrati promenljivu pošto države sa niskim brojem prijavljenih slučaja HIV-a ukazuju na normalan nivo socijalne svesti i ponašanja pojedinaca koji su u direktnoj korelaciji sa životnim vekom.
selected_features = []
print("FORWARD SELECTION REZULTATI\n")
def adjR2(r2,n,p):
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
return adj_r2
for feature in features:
selected_features.append(feature)
model.fit(X_train[selected_features], y_train)
y_train_pred = model.predict(X_train[selected_features])
y_val_pred = model.predict(X_val[selected_features])
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)
n = X_val[selected_features].shape[0]
p = X_val[selected_features].shape[1]
adj_r2_val = adjR2(val_r2,n,p)
print("Features:", selected_features)
print("Train RMSE:", round(train_rmse, 3))
print("Val RMSE:", round(val_rmse, 3))
print("Train R2:", round(train_r2, 3))
print("Val R2:", round(val_r2, 3))
print("Adjusted Val R2:", round(adj_r2_val, 3))
print("-" * 40)
FORWARD SELECTION REZULTATI Features: ['GDP'] Train RMSE: 7.951 Val RMSE: 7.862 Train R2: 0.302 Val R2: 0.306 Adjusted Val R2: 0.305 ---------------------------------------- Features: ['GDP', 'Schooling'] Train RMSE: 6.065 Val RMSE: 5.717 Train R2: 0.594 Val R2: 0.633 Adjusted Val R2: 0.631 ---------------------------------------- Features: ['GDP', 'Schooling', 'infant deaths'] Train RMSE: 6.056 Val RMSE: 5.707 Train R2: 0.595 Val R2: 0.634 Adjusted Val R2: 0.632 ---------------------------------------- Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years'] Train RMSE: 5.968 Val RMSE: 5.726 Train R2: 0.607 Val R2: 0.632 Adjusted Val R2: 0.629 ---------------------------------------- Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing'] Train RMSE: 5.96 Val RMSE: 5.717 Train R2: 0.608 Val R2: 0.633 Adjusted Val R2: 0.629 ---------------------------------------- Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing', 'Alcohol'] Train RMSE: 5.893 Val RMSE: 5.574 Train R2: 0.616 Val R2: 0.651 Adjusted Val R2: 0.646 ---------------------------------------- Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing', 'Alcohol', 'immunization_index'] Train RMSE: 5.705 Val RMSE: 5.313 Train R2: 0.641 Val R2: 0.683 Adjusted Val R2: 0.678 ---------------------------------------- Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing', 'Alcohol', 'immunization_index', 'HIV/AIDS'] Train RMSE: 4.589 Val RMSE: 4.346 Train R2: 0.767 Val R2: 0.788 Adjusted Val R2: 0.784 ----------------------------------------
(Napomena, da biste videli sve rezultati, pogledajte poslednji output kao scrollable element)
Posmatranjem forward selectiona i svih iteracija, dolazimo do zaključka da promenljive: infant deaths, thinness 10-19 years i Status Developing skoro uopšte ne poboljšavaju metrike modela, to jest, jasno se vidi da RMSE ne opada i da Adjusted R^2 ne raste, odakle dolazimo da zaključka da sa ovim promenljivima model stagnira.
Zadržavamo sve ostale promenljive i sagledavamo metrike nad njima.
final_features =[
'GDP',
'Schooling',
'Alcohol',
'immunization_index',
'HIV/AIDS',
]
model.fit(X_train[final_features], y_train)
y_train_pred = model.predict(X_train[final_features])
y_val_pred = model.predict(X_val[final_features])
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)
n = X_val[final_features].shape[0]
p = X_val[final_features].shape[1]
adj_r2_val = adjR2(val_r2,n,p)
print("Features:", final_features)
print("Train RMSE:", round(train_rmse, 3))
print("Val RMSE:", round(val_rmse, 3))
print("Train R2:", round(train_r2, 3))
print("Val R2:", round(val_r2, 3))
print("Adjusted Val R2:", round(adj_r2_val, 3))
print("-" * 40)
Features: ['GDP', 'Schooling', 'Alcohol', 'immunization_index', 'HIV/AIDS'] Train RMSE: 4.66 Val RMSE: 4.41 Train R2: 0.76 Val R2: 0.782 Adjusted Val R2: 0.779 ----------------------------------------
Izbacivanjem ovih promenljivih i poredjenjem metrika prvobitnog modela vidimo da je prvobitni model minimalno bolji u predikciji (za adj R2 razlika je 0.005), taj boljitak je statistički neznačajan pa ćemo zadržati model sa manje promenljivih čime osiguravamo da smo zadržali samo ključne promenljive. Ovom oznakom metrike R2 = 0.779 što nam govori da naš model objašnjava 78% varijabilnosti promenljive "Life Expectancy" nad validacionim skupom, ujedno vidimo da je i RMSE nad validacionim skupom 4.41 što je solidan rezultat. Ostalo nam je još da proverimo da li postoji multikolinearnost izmedju datih promenljivih, uporedjivanjem ovih promenljivih se na oko čini da to ne bi trebao da bude slučaj, ali nam je potrebno da to potporimo računom, s toga ćemo izračunati VIF ovog modela.
VIF (Variance Inflation Factor) je metrika koja nam naznačava koliko su težine modela povećane zbog multikolinearnosti medju nezavisnim promenljivima, generalno rečeno vrednost za VIF koja je manja od 5 se smatra da ne postoji jaka korelisanost izmedju promenljivih posmatranog modela.
X = dataframe[final_features]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Feature VIF 0 const 29.179061 1 GDP 1.447015 2 Schooling 2.056805 3 Alcohol 1.533496 4 immunization_index 1.244778 5 HIV/AIDS 1.078548
Sada kada sagledamo VIF vrednosti za sve nezavisne promenljive koje smo uključili u naš model, možemo videti da ne postoji jaka korelisanost izmedju promenljivih. Nakon poredjenja VIF metrike, smatramo da naš model zadovoljavajuće generalizuje problem predvidjanja očekivanog životnog veka, s toga napokon možemo sagledati njegovo ponašanje nad testnim skupom.
y_test_pred = model.predict(X_test[final_features])
test_rmse = np.sqrt(mean_squared_error(y_test,y_test_pred))
test_mae = mean_absolute_error(y_test,y_test_pred)
test_r2 = r2_score(y_test,y_test_pred)
print("TEST RMSE :",test_rmse)
print("TEST MAE : ",test_mae)
print("TEST R2",test_r2)
TEST RMSE : 4.144140131371196 TEST MAE : 3.247733999905864 TEST R2 0.8018094752106939
Nad testnim skupom dobijamo očekivanu vrednost za R2 metriku koja je približno jednaka onoj sa validacionog skupa što dodatno naznačava da ne postoji overfit u bilo kakvom smislu, zaključak je isti odnosno model uspeva da objasni 80% varijabilnosti ciljne promenljive "Life Expectancy". Odnosno objašnjava 80% podataka koji odstupaju od proseka promenljive "Life Exepctancy" dok ostalih 20% potiču od faktora koje možda nismo uvedeli ali je vrlo verovatnije da su nastali od velikog šuma koji je bio veoma prisutan u skupu podataka. S druge strane MAE nam ukazuje na to da u proseku naš model u proseku greši za ±3.2 godine u svojim predvidjanjima, što je na nivou države sasvim solidno predvidjanje jednog ovoliko prostog modela.
Implementacija ostalih modela i poredjenje¶
Glavno pitanje koje je postavljeno pri izradi ovog seminarskog rada je sledeće: "Kako da pomoću socio-ekonomskih faktora predstavimo model koji može da prediktuje životni vek države?" Probali smo da odgovorimo na ovo pitanje kreiranjem modela linearne regresije koji prediktuje kontinuirani tip podataka promenljive "Life Expectancy", pošto je ovaj problem regresioni možemo primeniti i modele poput Ridge/Lasso Regression, Random Forest, XGBoost, itd. Na ovaj način možemo direktno uporediti naš model sa osatlim modelima i doći do novih zapažanja i odnosa koje možda nismo uvideli.
Ridge i Lasso Regularizacija¶
Prve modele koje ćemo posmatrati koriste metode regularizacije linearne regresije Ridge i Lasso Regression. Ideja modela je da osnovnu linearnu regresiju modifikuju uključujući kazneni parametar alfa(lambda) koji smanjuje vrednost (ili anulira u potpunosti) težinskih koeficijenata koji stoje uz odgovarajuće prediktore, sve zarad veće moći generalizacije. Ova dva metoda se razlikuju po tome što Ridge smanjuje pojedine koeficijente toliko da postaju približno jedanki nuli, dok Lasso postavlja vrednost koeficijenata na nula i tako ih u potpunosti uklanja iz jednačine.
Pre nego što napravimo Ridge i Lasso modele odredićemo najbolje vrednosti za paramtear alfa koristeći Cross Validaciju.
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X_train, y_train)
best_ridge_alpha = ridge_cv.alpha_
print("Best alpha Ridge:", best_ridge_alpha)
lasso_cv = LassoCV(alphas=alphas, max_iter=10000)
lasso_cv.fit(X_train, y_train)
best_lasso_alpha = lasso_cv.alpha_
print("Best alpha Lasso:", best_lasso_alpha)
Best alpha Ridge: 1.0 Best alpha Lasso: 0.001
Sada za ovako dobijene najbolje parametre alpha treniramo naše modele i dobijamo R2,RMSE i MAE
ridge = Ridge(alpha=best_ridge_alpha, max_iter=10000)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
print("Ridge RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_ridge)))
print("Ridge MAE:", mean_absolute_error(y_test, y_pred_ridge))
Ridge RMSE: 3.3521749529932077 Ridge MAE: 2.468885417275904
lasso = Lasso(alpha=best_lasso_alpha, max_iter=10000)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
print("Lasso RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_lasso)))
print("Lasso MAE:", mean_absolute_error(y_test, y_pred_lasso))
Lasso RMSE: 3.3561175428908134 Lasso MAE: 2.472267776463046
Sledeća dva modela koja ćemo posmatrati su Random Forest i XGBoost.
Random Forest predstavlja skup metoda koja kombinuje veliki broj stabala odlučivanja (decision trees), svako stablo se trenira na nasumičnom podskupu podataka i podskupu promenljivih, čime se smanjuje varijansa modela. Konačna predikcija dobija se prosekom (kod regresije) ili glasanjem (kod klasifikacije), model jako dobro zaobilazi problem overfitting-a i dobro funkcioniše i kada postoji nelinearna zavisnost između promenljivih.
XGBoost (Extreme Gradient Boosting) je optimizovana implementacija gradient boosting algoritma,on je takodje implementacija stabla odlučivanja, ideja je da model gradimo sekvencijalno, tako što svako novo stablo pokušava da ispravi greške prethodnih stabala. Koristi regularizaciju (L1 i L2) kako bi se smanjila kompleksnost modela i sprečio overfitting što je generalna odlika Decision tree algoritama, zbog visoke efikasnosti i performansi, XGBoost se često koristi u takmičarskim i realnim problemima u oblasti mašinskog učenja.
rf = RandomForestRegressor(n_estimators=200,max_depth=None,random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random forest RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("Random forest MAE:", mean_absolute_error(y_test, y_pred_rf))
Random forest RMSE: 1.6068533257307265 Random forest MAE: 1.1079782608695632
xgb = XGBRegressor(n_estimators=200,learning_rate=0.05,max_depth=4,subsample=0.8,colsample_bytree=0.8,random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_xgb)))
print("XGBoost MAE:", mean_absolute_error(y_test, y_pred_xgb))
XGBoost RMSE: 1.7944843679286826 XGBoost MAE: 1.324241596283029
Poredjenje modela¶
Nakon što smo formirali i iztrenirali sve modele sada možemo da uporedimo sve metrike koje smo dobili i uvideti kako se naš model linearne regresije poredi sa ostalima
| Model | RMSE | MAE |
|---|---|---|
| Linear Regression | 4.14 | 3.24 |
| Ridge | 3.35 | 2.46 |
| Lasso | 3.35 | 2.47 |
| Random Forest | 1.60 | 1.10 |
| XGBoost | 1.79 | 1.32 |
Prikazom tabele, možemo prvo uporediti naš linearni model sa dva modela koja koriste regularizaciju (Ridge i Lasso), poredjenjem uvidimo da naš model greši za ≈ 0.8 godina za MAE što je sasvim solidan rezultati ako imamo u vidu da ovi metodi regularizacije traže što optimalniji model tako što isključuju odredjene prediktore preko koeficijenta i time traže što optimalniji model.
Sa druge strane ako posmatramo Random Forest i XGBoost modele, u njihovom slučaju vidimo značajne dobtike, gde su MAE i RMSE oba modela jako niski pokazujući veliku uspešnost predikcije Life Expectancy promenljive. Ovo ukazuje na ključnu prednost modela zasnovanih na stablima odlučivanja - sposobnost modelovanja nelinearnih odnosa između promenljivih. Random Forest to postiže bagging pristupom i kombinovanjem više stabala, dok XGBoost koristi gradient boosting, gde svako novo stablo sekvencijalno ispravlja greške prethodnih.
Iako modeli koji koriste nasumične šume imaju najbolju prediktivnu moć, svi oni imaju svoje prednosti i mane, i njihovi rezultati i primena se menjaju u zavisnosti od problema do problema koji je potrebno da reše, dodatno je bitno naglasiti da se svi modeli drugačije ponašaju i u zavinosti od toga da li su podaci normalizovani, linearni ili nelinearni, koliko resursa imamo na raspolaganju pri rešavanju problema i slično.
Zaključak¶
Ovaj seminarski rad se zasnivao na ideji implementiranja metoda Nauke o podacima i metoda Mašinskog učenja kako bismo kroz sve korake (eksplorativna analiza, čišćenje podataka, priprema podataka, feature engineering, feature selection, izgradnja modela, implementacija i poredjenje modela) pokušali da što optimalnije predvidimo vrednosti očekivanog životnog veka (Life expectancy).
Analizom podataka smo došli do zaključka da promenljive koje su na kraju odabrane za prediktore modela, BDP države, nivo edukacije države (Schooling), konzumiranje alkohola na nivou države, logaritamsku vrednost za zastupnost bolesti HIV/AIDS, kao i promenljiva dobijena "Feature engineeringom" immunization_index.
Takodje je primenom neregularizovane višestruke linearne regresije u odnosu na ciljnu promenljivu Life expectancy dobijen model koji sa ovako limitiranim brojem promenljivih opisuje 80% varijabilnosti očekivanog životnog veka.
Poredjenjem ovog modela sa ostalim modelima, možemo uočiti da modeli koji koriste regularizacione metode (Ridge i Lasso) su kaskali za modelima koji su najbolje uspevali da opišu variajbilnost očekivanog životnog veka i pružali najmanje vrednosti RMSE i MAE metrika.
Projekat bi se mogao unaprediti eksperimentisanjem različitih pristupa u feature engineering fazi, gde bi odredjene promenljive se mogle pretvoriti u kategorijske (HIV/AIDS,Schooling i slične). Za dodatno unapredjenje bi posebno pomoglo kada bi skup podataka sadržao konciznije vrednosti za promenljivu BMI, jedna ideja iza toga bi mogla biti uvezivanje tih vrednosti iz nekog drugog skupa podataka. Takodje bi bilo korisno pronaći outlier vrednosti preko Z-score metrike, dodatno bi mogli primeniti i testirati Support Vector Regression (SVR) model.
Ovaj rad je pokazatelj da pravilnim vodjenjem osnovnih principa Nauke o podacima i primenom metoda koje ona nalaže se mogu konstruisati precizni modeli za predvidanje očekivanog životnog veka jedne populacije. Model koji smo mi projektovali - model linearne regresije, se pokazao kao poprilično interpretabilan i efikasan pristup za ovaj problem koji je uz manji broj prediktora uspeo da objasni 4/5 varijabilnosti očekivanog životnog veka.